Refactor `store` package for performance and simplicity by Anton-Kalpakchiev · Pull Request #633 · uber/kraken

Anton-Kalpakchiev · 2026-06-06T14:38:32Z

TLDR

This PR refactors the store package to 1) be more performant (transition from a TTI to an LRU eviction policy). and 2) be much simpler (go from 8K to <3K LOC, remove unnecessary abstractions, etc.). Please review each commit and its description separately (the PR itself will become quite large).

NOTE - this PR is not plugged into production code, that will be done in a separate PR.

Justification

Why transition from a TTI to an LRU eviction policy (copied from this ERD)

Each origin replica in the hashring has a disk and a memory cache. Both caches are severely underutilized, bleeding performance. The disk cache has 30% util, while the mem cache has <10% util. By refactoring both caches to be LRU-based, we will achieve 100% utilization, significantly improving kraken-origin performance. There is absolutely 0 reason not to do this, as Kraken’s data is immutable, therefore it is optimal to evict data only when necessary, i.e. when making room to cache newer data.

Why refactor the code

The store package has balooned to over ~8k lines with several layers of abstraction. I believe this is way beyond the package's inherent complexity. Changing the package, as proposed in this PR is non-trivial. At this point, it's easier to rewrite it from scratch, reusing existing parts wherever possible. I expect this to 1) significantly simplify the package's interface and 2) reduce the package's internal complexity (from ~8K LOC to <3K).

Implementation (copied from the ERD)

Currently, the cache works as follows:

If the blob is in either caches, serve it to the user.
Otherwise, download it in the memory cache, serve it to the user, replicate it to the disk cache, and evict it from the mem cache.
Data is evicted from disk using a 1h TTI.
Data is evicted from memory as soon as it’s replicated to disk.
If the memory cache is full and a new download request comes in, we fallback to disk (the mem cache is not used for the download).

I propose the following changes:

We only evict data from memory using an LRU to free space for more data.
We only evict data from disk using an LRU to free space for more data.
We cannot evict blobs from the mem cache that are not yet replicated to the disk cache

The new flow is illustrated below:

The overall structure of the new cache will be a tiered store that utilizes both a disk and a mem cache. First, I'll implement the LRU disk store, after which I'll reuse the existing memory cache (maybe slightly tweaked) to create the tiered store. This commit designs the client-facing interface of the disk store.

This commit establishes the data structures and design for the LRU disk store and implements the basic methods needed to write and read from the store. I'll add more tests later.

This commit implements more APIs for the DiskStore and adds extra testing. Once I implement all APIs, I'll focus on the testing suite more. I ended up deciding we don't need a separate `CancelWrite` API and `Delete` can be reused, which should simplify the interface. I also fix a small bug in `Get` where I was using a read lock instead of a write lock.

The "commit" mechanic the store previously exposed assumed that blobs are never read before being fully written to the store. However, this assumption turned out wrong - the agent specifically can seed blobs before they are fully downloaded locally. This commit refactors the "commit" mechanic into a "complete" mechanic. Previously, blobs were either fully in the store or not in the store: - blobs could not be read before they are committed - read APIs ignored uncommitted blobs - uncommitted blobs cannot get automatically evicted by the LRU Now: - blobs may be read before being fully written. - read APIs have an arg `ignoreIncomplete` incase they still want to ignore incomplete blobs (the origin needs this) I also rename some APIs after removing the commit mechanic and add more tests.

- Implement the BanEviction and UnbanEviction APIs of the store that stop a blob from getting evicted - Finish other APIs that were half-implemented due to unevictable blobs not being implemented. - Change the API of Delete to now be able to delete evictable blobs. I looked through the current use of `Delete` for the ca_store and the only time it checked if it was trying to delete persisted blobs was during cleanup. Now that the store's APIs themselves do cleanup, there is no reason for Delete to respect whether a blob is evictable or not. - Add **extensive** testing for 1) the eviction logic and 2) other APIs. - Make some actions no-ops instead of returning errors (e.g. calling MarkComplete on an already complete blob) to make the interface more forgiving. The next commits will: - implement Metadata operations - add logic to recover state from disk after a crash.

thijmv · 2026-06-11T12:45:58Z

+
+// Create adds a new, incomplete blob to the store and reserves space for it.
+// Incomplete entries cannot be automatically evicted. MarkComplete must be called once the blob is complete.
+// DiskStore does not ever check/use the real size of the blob and only uses `sizeBytes` for its eviction logic.


Do you think it is feasible to have rwImpl enforce that no more bytes are written than allocated?

We could, but I'm not sure we want that - clients usually call Stat on a blob before putting in the store and the Stat implementation for the GCS client doesn't always return the actual size of the blob - sometimes it slightly underreports the size. This should be ok, as the difference should be only a few bytes, which shd be negligible when the store's capacity is 1+ TB.

On a similar note, we actually had a problem in the mem cache where Stat was underreporting blob sizes by a bit, e.g. a 101 byte blob was being reported as 100 bytes. So during the reservation logic, we did size += 101 and then when deleting the blob, we did size -= len(data) which essentially did size -= 101. This caused integer underflow and led to bugs. This store implementation only ever uses the client-provided size to avoid this exact scenario.

thijmv · 2026-06-11T12:52:48Z

+	// TODO - emit latency to reserve space for a blob.
+	for s.size+space > s.capacity {
+		if s.evictQueue.Len() == 0 {
+			return errors.New("cannot evict enough, the unevictable/incomplete blobs are using up all the space")


Is this scenario something that may happen in production during high loads? Should we enforce a TTI (e.g. on inactivity or a hard deadline) on incomplete/unevictable blobs to prevent the store from being stuck because it is filled only with incomplete/unevictable blobs?

Is this scenario something that may happen in production during high loads?

If the load is so high that we need to start evicting blobs before they are even fully written to the store, then we would essentially never be able to complete additions to the store, which would render the whole store useless. This error is here to prevent that. In practice, I'm not sure if that would actually happen.

Should we enforce a TTI (e.g. on inactivity or a hard deadline) on incomplete/unevictable blobs to prevent the store from being stuck because it is filled only with incomplete/unevictable blobs?

This would only be a concern if a client doesn't call MarkComplete, leaking the blob. I thought about it (and even left a TODO in the code to consider it), but decided to trust the clients not to leak files for now. If we see in prod that files are getting leaked, we can implement the TTI, but for now, I'd prefer to avoid adding extra complexity that is not proven to be needed.

thijmv · 2026-06-11T12:54:45Z

+
+	if _, ok := s.blobs[key]; ok {
+		// TODO - consider whether we need public errors for these cases.
+		return nil, errors.New("blob is already in store")


nit: Move the errors.New(...)s to var sentinels at the top of the file.

after refactoring, we don't actually repeat the same error message twice. Do you still think we should move the errors.New to the top of the file?

thijmv · 2026-06-11T13:38:16Z

+	t.Skip("TODO")
+}
+
+func TestDelete(t *testing.T) {


(same elsewhere): All of these tests, except their last subtest, use a similar structure. Do you think it would be better moving them to a table test?

not sure - id need to have a few if statements in the general test case, which in general is not good practice for table tests as it makes them harder to read (you need to consider all possible scenarios). That's why I decided to just duplicate some test logic. WDYT?

Previously, the store used 3 maps to differentiate between different blob types: - complete blobs - unevictable blobs - incomplete blobs This was not a clean solution, as now all key-value APIs now needed to check all 3 maps for the existence of the key. Also APIs that transitioned a blob from one blob type to another required moving the blob between the maps. To remove this complexity, now we use a single map to manage all blobs. Instead of determining the blob's type by checking which map it belongs to, we create a `blob` struct with the blob's necessary data and store that in the map.

Anton-Kalpakchiev requested review from iawakimjan and sambhav-jain-16 June 6, 2026 14:38

github-actions Bot added the size/m label Jun 6, 2026

Anton-Kalpakchiev changed the title ~~Refactor store package to be simpler and more performant~~ Refactor store package for performance and simplicity Jun 6, 2026

Anton-Kalpakchiev requested a review from thijmv June 6, 2026 14:45

github-actions Bot added the size/xl label Jun 6, 2026

Anton-Kalpakchiev force-pushed the lru-cache branch from 6eafd5e to 3c297e5 Compare June 6, 2026 19:35

Anton-Kalpakchiev added 2 commits June 6, 2026 21:40

feat(DiskStore): implement StartWrite, CommitWrite, and Get

9c7805a

This commit establishes the data structures and design for the LRU disk store and implements the basic methods needed to write and read from the store. I'll add more tests later.

Anton-Kalpakchiev force-pushed the lru-cache branch from 3c297e5 to 9c7805a Compare June 6, 2026 19:43

Anton-Kalpakchiev force-pushed the lru-cache branch from d3ffa2a to 17bb3fc Compare June 8, 2026 14:01

Anton-Kalpakchiev added 3 commits June 8, 2026 19:58

feat(DiskStore): implement storing blob metadata

d6ac925

thijmv reviewed Jun 11, 2026

View reviewed changes

feat(DiskStore): reboot state from disk after crash

b7fe164

Anton-Kalpakchiev force-pushed the lru-cache branch from 03c5d91 to b7fe164 Compare June 11, 2026 18:48

Anton-Kalpakchiev force-pushed the lru-cache branch 3 times, most recently from fea05e7 to 01cd3d2 Compare June 12, 2026 15:29

fix(DiskStore): Fix tests and improve style

2683553

Anton-Kalpakchiev force-pushed the lru-cache branch from 01cd3d2 to 2683553 Compare June 12, 2026 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `store` package for performance and simplicity#633

Refactor `store` package for performance and simplicity#633
Anton-Kalpakchiev wants to merge 9 commits into
masterfrom
lru-cache

Anton-Kalpakchiev commented Jun 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

thijmv Jun 11, 2026

Uh oh!

Anton-Kalpakchiev Jun 11, 2026

Uh oh!

thijmv Jun 11, 2026

Uh oh!

Anton-Kalpakchiev Jun 11, 2026 •

edited

Loading

Uh oh!

thijmv Jun 11, 2026

Uh oh!

Anton-Kalpakchiev Jun 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thijmv Jun 11, 2026

Uh oh!

Anton-Kalpakchiev Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Anton-Kalpakchiev commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TLDR

Justification

Why transition from a TTI to an LRU eviction policy (copied from this ERD)

Why refactor the code

Implementation (copied from the ERD)

Uh oh!

Uh oh!

thijmv Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Anton-Kalpakchiev Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

thijmv Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Anton-Kalpakchiev Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thijmv Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Anton-Kalpakchiev Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thijmv Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Anton-Kalpakchiev Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Anton-Kalpakchiev commented Jun 6, 2026 •

edited

Loading

Anton-Kalpakchiev Jun 11, 2026 •

edited

Loading