Refactor store package for performance and simplicity#633
Refactor store package for performance and simplicity#633Anton-Kalpakchiev wants to merge 9 commits into
store package for performance and simplicity#633Conversation
store package to be simpler and more performantstore package for performance and simplicity
6eafd5e to
3c297e5
Compare
The overall structure of the new cache will be a tiered store that utilizes both a disk and a mem cache. First, I'll implement the LRU disk store, after which I'll reuse the existing memory cache (maybe slightly tweaked) to create the tiered store. This commit designs the client-facing interface of the disk store.
This commit establishes the data structures and design for the LRU disk store and implements the basic methods needed to write and read from the store. I'll add more tests later.
3c297e5 to
9c7805a
Compare
This commit implements more APIs for the DiskStore and adds extra testing. Once I implement all APIs, I'll focus on the testing suite more. I ended up deciding we don't need a separate `CancelWrite` API and `Delete` can be reused, which should simplify the interface. I also fix a small bug in `Get` where I was using a read lock instead of a write lock.
d3ffa2a to
17bb3fc
Compare
The "commit" mechanic the store previously exposed assumed that blobs are never read before being fully written to the store. However, this assumption turned out wrong - the agent specifically can seed blobs before they are fully downloaded locally. This commit refactors the "commit" mechanic into a "complete" mechanic. Previously, blobs were either fully in the store or not in the store: - blobs could not be read before they are committed - read APIs ignored uncommitted blobs - uncommitted blobs cannot get automatically evicted by the LRU Now: - blobs may be read before being fully written. - read APIs have an arg `ignoreIncomplete` incase they still want to ignore incomplete blobs (the origin needs this) I also rename some APIs after removing the commit mechanic and add more tests.
- Implement the BanEviction and UnbanEviction APIs of the store that stop a blob from getting evicted - Finish other APIs that were half-implemented due to unevictable blobs not being implemented. - Change the API of Delete to now be able to delete evictable blobs. I looked through the current use of `Delete` for the ca_store and the only time it checked if it was trying to delete persisted blobs was during cleanup. Now that the store's APIs themselves do cleanup, there is no reason for Delete to respect whether a blob is evictable or not. - Add **extensive** testing for 1) the eviction logic and 2) other APIs. - Make some actions no-ops instead of returning errors (e.g. calling MarkComplete on an already complete blob) to make the interface more forgiving. The next commits will: - implement Metadata operations - add logic to recover state from disk after a crash.
|
|
||
| // Create adds a new, incomplete blob to the store and reserves space for it. | ||
| // Incomplete entries cannot be automatically evicted. MarkComplete must be called once the blob is complete. | ||
| // DiskStore does not ever check/use the real size of the blob and only uses `sizeBytes` for its eviction logic. |
There was a problem hiding this comment.
Do you think it is feasible to have rwImpl enforce that no more bytes are written than allocated?
There was a problem hiding this comment.
We could, but I'm not sure we want that - clients usually call Stat on a blob before putting in the store and the Stat implementation for the GCS client doesn't always return the actual size of the blob - sometimes it slightly underreports the size. This should be ok, as the difference should be only a few bytes, which shd be negligible when the store's capacity is 1+ TB.
On a similar note, we actually had a problem in the mem cache where Stat was underreporting blob sizes by a bit, e.g. a 101 byte blob was being reported as 100 bytes. So during the reservation logic, we did size += 101 and then when deleting the blob, we did size -= len(data) which essentially did size -= 101. This caused integer underflow and led to bugs. This store implementation only ever uses the client-provided size to avoid this exact scenario.
| // TODO - emit latency to reserve space for a blob. | ||
| for s.size+space > s.capacity { | ||
| if s.evictQueue.Len() == 0 { | ||
| return errors.New("cannot evict enough, the unevictable/incomplete blobs are using up all the space") |
There was a problem hiding this comment.
Is this scenario something that may happen in production during high loads? Should we enforce a TTI (e.g. on inactivity or a hard deadline) on incomplete/unevictable blobs to prevent the store from being stuck because it is filled only with incomplete/unevictable blobs?
There was a problem hiding this comment.
Is this scenario something that may happen in production during high loads?
If the load is so high that we need to start evicting blobs before they are even fully written to the store, then we would essentially never be able to complete additions to the store, which would render the whole store useless. This error is here to prevent that. In practice, I'm not sure if that would actually happen.
Should we enforce a TTI (e.g. on inactivity or a hard deadline) on incomplete/unevictable blobs to prevent the store from being stuck because it is filled only with incomplete/unevictable blobs?
This would only be a concern if a client doesn't call MarkComplete, leaking the blob. I thought about it (and even left a TODO in the code to consider it), but decided to trust the clients not to leak files for now. If we see in prod that files are getting leaked, we can implement the TTI, but for now, I'd prefer to avoid adding extra complexity that is not proven to be needed.
|
|
||
| if _, ok := s.blobs[key]; ok { | ||
| // TODO - consider whether we need public errors for these cases. | ||
| return nil, errors.New("blob is already in store") |
There was a problem hiding this comment.
nit: Move the errors.New(...)s to var sentinels at the top of the file.
There was a problem hiding this comment.
after refactoring, we don't actually repeat the same error message twice. Do you still think we should move the errors.New to the top of the file?
| t.Skip("TODO") | ||
| } | ||
|
|
||
| func TestDelete(t *testing.T) { |
There was a problem hiding this comment.
(same elsewhere): All of these tests, except their last subtest, use a similar structure. Do you think it would be better moving them to a table test?
There was a problem hiding this comment.
not sure - id need to have a few if statements in the general test case, which in general is not good practice for table tests as it makes them harder to read (you need to consider all possible scenarios). That's why I decided to just duplicate some test logic. WDYT?
03c5d91 to
b7fe164
Compare
Previously, the store used 3 maps to differentiate between different blob types: - complete blobs - unevictable blobs - incomplete blobs This was not a clean solution, as now all key-value APIs now needed to check all 3 maps for the existence of the key. Also APIs that transitioned a blob from one blob type to another required moving the blob between the maps. To remove this complexity, now we use a single map to manage all blobs. Instead of determining the blob's type by checking which map it belongs to, we create a `blob` struct with the blob's necessary data and store that in the map.
fea05e7 to
01cd3d2
Compare
01cd3d2 to
2683553
Compare
TLDR
This PR refactors the
storepackage to 1) be more performant (transition from a TTI to an LRU eviction policy). and 2) be much simpler (go from 8K to <3K LOC, remove unnecessary abstractions, etc.). Please review each commit and its description separately (the PR itself will become quite large).NOTE - this PR is not plugged into production code, that will be done in a separate PR.
Justification
Why transition from a TTI to an LRU eviction policy (copied from this ERD)
Each origin replica in the hashring has a disk and a memory cache. Both caches are severely underutilized, bleeding performance. The disk cache has 30% util, while the mem cache has <10% util. By refactoring both caches to be LRU-based, we will achieve 100% utilization, significantly improving kraken-origin performance. There is absolutely 0 reason not to do this, as Kraken’s data is immutable, therefore it is optimal to evict data only when necessary, i.e. when making room to cache newer data.
Why refactor the code
The
storepackage has balooned to over ~8k lines with several layers of abstraction. I believe this is way beyond the package's inherent complexity. Changing the package, as proposed in this PR is non-trivial. At this point, it's easier to rewrite it from scratch, reusing existing parts wherever possible. I expect this to 1) significantly simplify the package's interface and 2) reduce the package's internal complexity (from ~8K LOC to <3K).Implementation (copied from the ERD)
Currently, the cache works as follows:
I propose the following changes:
The new flow is illustrated below:
