Skip to content

Refactor store package for performance and simplicity#633

Open
Anton-Kalpakchiev wants to merge 9 commits into
masterfrom
lru-cache
Open

Refactor store package for performance and simplicity#633
Anton-Kalpakchiev wants to merge 9 commits into
masterfrom
lru-cache

Conversation

@Anton-Kalpakchiev

@Anton-Kalpakchiev Anton-Kalpakchiev commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

TLDR

This PR refactors the store package to 1) be more performant (transition from a TTI to an LRU eviction policy). and 2) be much simpler (go from 8K to <3K LOC, remove unnecessary abstractions, etc.). Please review each commit and its description separately (the PR itself will become quite large).

NOTE - this PR is not plugged into production code, that will be done in a separate PR.

Justification

Why transition from a TTI to an LRU eviction policy (copied from this ERD)

Each origin replica in the hashring has a disk and a memory cache. Both caches are severely underutilized, bleeding performance. The disk cache has 30% util, while the mem cache has <10% util. By refactoring both caches to be LRU-based, we will achieve 100% utilization, significantly improving kraken-origin performance. There is absolutely 0 reason not to do this, as Kraken’s data is immutable, therefore it is optimal to evict data only when necessary, i.e. when making room to cache newer data.

Why refactor the code

The store package has balooned to over ~8k lines with several layers of abstraction. I believe this is way beyond the package's inherent complexity. Changing the package, as proposed in this PR is non-trivial. At this point, it's easier to rewrite it from scratch, reusing existing parts wherever possible. I expect this to 1) significantly simplify the package's interface and 2) reduce the package's internal complexity (from ~8K LOC to <3K).

Implementation (copied from the ERD)

Currently, the cache works as follows:

  1. If the blob is in either caches, serve it to the user.
  2. Otherwise, download it in the memory cache, serve it to the user, replicate it to the disk cache, and evict it from the mem cache.
  3. Data is evicted from disk using a 1h TTI.
  4. Data is evicted from memory as soon as it’s replicated to disk.
  5. If the memory cache is full and a new download request comes in, we fallback to disk (the mem cache is not used for the download).

I propose the following changes:

  1. We only evict data from memory using an LRU to free space for more data.
  2. We only evict data from disk using an LRU to free space for more data.
  3. We cannot evict blobs from the mem cache that are not yet replicated to the disk cache

The new flow is illustrated below:
image

@github-actions github-actions Bot added the size/m label Jun 6, 2026
@Anton-Kalpakchiev Anton-Kalpakchiev changed the title Refactor store package to be simpler and more performant Refactor store package for performance and simplicity Jun 6, 2026
@Anton-Kalpakchiev Anton-Kalpakchiev requested a review from thijmv June 6, 2026 14:45
The overall structure of the new cache will be a tiered store that
utilizes both a disk and a mem cache. First, I'll implement the LRU
disk store, after which I'll reuse the existing memory cache (maybe
slightly tweaked) to create the tiered store. This commit designs the
client-facing interface of the disk store.
This commit establishes the data structures and design for the LRU
disk store and implements the basic methods needed to write and read
from the store. I'll add more tests later.
This commit implements more APIs for the DiskStore and adds extra testing.
Once I implement all APIs, I'll focus on the testing suite more.

I ended up deciding we don't need a separate `CancelWrite` API and
`Delete` can be reused, which should simplify the interface.

I also fix a small bug in `Get` where I was using a read lock instead of
a write lock.
The "commit" mechanic the store previously exposed assumed that blobs
are never read before being fully written to the store. However, this
assumption turned out wrong - the agent specifically can seed blobs
before they are fully downloaded locally.

This commit refactors the "commit" mechanic into a "complete" mechanic.
Previously, blobs were either fully in the store or not in the store:
 - blobs could not be read before they are committed
 - read APIs ignored uncommitted blobs
 - uncommitted blobs cannot get automatically evicted by the LRU

Now:
 - blobs may be read before being fully written.
 - read APIs have an arg `ignoreIncomplete` incase they still want
to ignore incomplete blobs (the origin needs this)

I also rename some APIs after removing the commit mechanic and add more
tests.
 - Implement the BanEviction and UnbanEviction APIs of the store that
 stop a blob from getting evicted
 - Finish other APIs that were half-implemented due to unevictable
 blobs not being implemented.
 - Change the API of Delete to now be able to delete evictable blobs.
 I looked through the current use of `Delete` for the ca_store and the
 only time it checked if it was trying to delete persisted blobs was
 during cleanup. Now that the store's APIs themselves do cleanup, there
 is no reason for Delete to respect whether a blob is evictable or not.
 - Add **extensive** testing for 1) the eviction logic and 2) other APIs.
 - Make some actions no-ops instead of returning errors (e.g. calling
 MarkComplete on an already complete blob) to make the interface more
 forgiving.

The next commits will:
 - implement Metadata operations
 - add logic to recover state from disk after a crash.
Comment thread lib/store/disk_store.go Outdated
Comment thread lib/store/disk_store.go

// Create adds a new, incomplete blob to the store and reserves space for it.
// Incomplete entries cannot be automatically evicted. MarkComplete must be called once the blob is complete.
// DiskStore does not ever check/use the real size of the blob and only uses `sizeBytes` for its eviction logic.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is feasible to have rwImpl enforce that no more bytes are written than allocated?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but I'm not sure we want that - clients usually call Stat on a blob before putting in the store and the Stat implementation for the GCS client doesn't always return the actual size of the blob - sometimes it slightly underreports the size. This should be ok, as the difference should be only a few bytes, which shd be negligible when the store's capacity is 1+ TB.

On a similar note, we actually had a problem in the mem cache where Stat was underreporting blob sizes by a bit, e.g. a 101 byte blob was being reported as 100 bytes. So during the reservation logic, we did size += 101 and then when deleting the blob, we did size -= len(data) which essentially did size -= 101. This caused integer underflow and led to bugs. This store implementation only ever uses the client-provided size to avoid this exact scenario.

Comment thread lib/store/disk_store.go
// TODO - emit latency to reserve space for a blob.
for s.size+space > s.capacity {
if s.evictQueue.Len() == 0 {
return errors.New("cannot evict enough, the unevictable/incomplete blobs are using up all the space")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this scenario something that may happen in production during high loads? Should we enforce a TTI (e.g. on inactivity or a hard deadline) on incomplete/unevictable blobs to prevent the store from being stuck because it is filled only with incomplete/unevictable blobs?

@Anton-Kalpakchiev Anton-Kalpakchiev Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this scenario something that may happen in production during high loads?

If the load is so high that we need to start evicting blobs before they are even fully written to the store, then we would essentially never be able to complete additions to the store, which would render the whole store useless. This error is here to prevent that. In practice, I'm not sure if that would actually happen.

Should we enforce a TTI (e.g. on inactivity or a hard deadline) on incomplete/unevictable blobs to prevent the store from being stuck because it is filled only with incomplete/unevictable blobs?

This would only be a concern if a client doesn't call MarkComplete, leaking the blob. I thought about it (and even left a TODO in the code to consider it), but decided to trust the clients not to leak files for now. If we see in prod that files are getting leaked, we can implement the TTI, but for now, I'd prefer to avoid adding extra complexity that is not proven to be needed.

Comment thread lib/store/disk_store.go Outdated

if _, ok := s.blobs[key]; ok {
// TODO - consider whether we need public errors for these cases.
return nil, errors.New("blob is already in store")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Move the errors.New(...)s to var sentinels at the top of the file.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after refactoring, we don't actually repeat the same error message twice. Do you still think we should move the errors.New to the top of the file?

Comment thread lib/store/disk_store_test.go Outdated
Comment thread lib/store/disk_store_test.go Outdated
Comment thread lib/store/disk_store_test.go Outdated
Comment thread lib/store/disk_store_test.go Outdated
Comment thread lib/store/disk_store_test.go Outdated
t.Skip("TODO")
}

func TestDelete(t *testing.T) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same elsewhere): All of these tests, except their last subtest, use a similar structure. Do you think it would be better moving them to a table test?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure - id need to have a few if statements in the general test case, which in general is not good practice for table tests as it makes them harder to read (you need to consider all possible scenarios). That's why I decided to just duplicate some test logic. WDYT?

Previously, the store used 3 maps to differentiate between different
blob types:
 - complete blobs
 - unevictable blobs
 - incomplete blobs

This was not a clean solution, as now all key-value APIs now needed
to check all 3 maps for the existence of the key. Also APIs that
transitioned a blob from one blob type to another required moving the
blob between the maps.

To remove this complexity, now we use a single map to manage all blobs.
Instead of determining the blob's type by checking which map it belongs
to, we create a `blob` struct with the blob's necessary data and store
that in the map.
@Anton-Kalpakchiev Anton-Kalpakchiev force-pushed the lru-cache branch 3 times, most recently from fea05e7 to 01cd3d2 Compare June 12, 2026 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants