cache: replace read/dirty with a concurrent hash-trie by twmb · Pull Request #6 · twmb/go-cache

twmb · 2026-04-21T23:49:16Z

Summary

Replaces the read/dirty/promote backing of cache.Cache with a concurrent
hash-trie modeled on Go's internal/sync.HashTrieMap, implemented entirely in
userspace (hash/maphash + unsafe.Sizeof(uintptr(0)); no internal/abi or
internal/goarch). Public API is unchanged; all documented semantics are
preserved, with two behavior shifts noted below.

The port is followed by fifteen hardening commits: an adversarial audit run
until it converged (two consecutive passes with zero behavioral findings)
fixed 25 behavioral bugs — every one reproduced red against the pre-fix code —
plus a series of doc inaccuracies. The broad strokes are below; each commit
message is self-sufficient.

Why

The read/dirty design amortizes a write-map promotion across operations; the
per-op cost is decent on average but adversarial workloads pay heavily for
the copy. The trie spreads cost across per-bucket locks and eliminates the
promotion cycle.
Upstream sync.Map moved to a hash trie in Go 1.24, so the old inspiration
in this package's docs was stale.
Singleflight, stale values, TTLs, idle-age, and error caching all live above
the storage layer and carry over unchanged.

The port (first six commits)

56c981f bump to go 1.24, fix stale-at-epoch, tighten Clean — baseline
cleanup. Go modules had claimed 1.18 but the code required 1.19+
(atomic.Pointer[T], atomic.Int64, atomic.Uint32). newStale was
producing a stale born at the Unix epoch when the main entry's expiry was
<= 0 — now bases on now(). Clean calls e.del() directly instead of
c.Delete(k) since c.each has already promoted.
d6e9fc3 expand test suite; fix data race in Get's post-miss return —
drives the old implementation to 100% statement coverage first, so the
port has a behavioral oracle. The new race-targeted tests caught a real
data race in the old code: Cache.Get read l.v with no happens-before
edge when a concurrent Swap had replaced e.p with its own loading.
8e06ee5 add concurrent hash-trie in isolation — new cache/trie.go +
unit tests. Lock-free loads; LoadOrStore/Delete under a per-bucket
mutex; bottom-up hand-over-hand pruning; root atomically swapped on
Clear; overflow chains for full hash collisions.
10c3ee7 back Cache with the hash-trie; retire read/dirty machinery —
Cache holds a trie[K, loading[V]] instead of r/mu/dirty/
misses/pd. Delete physically removes via deleteEntryIf; keys stay
promptly GC-eligible (TestIssue40999).
7c63a66 README: rewrite for the trie-backed cache — new design blurb,
fresh benchmark numbers (table below).
134488b port the stdlib sync.Map tests we were missing —
TestConcurrentClear, TestMapClearOneAllocation,
TestMapRangeNoAllocations, BenchmarkClear.

Hardening (remaining fifteen commits)

Trie/port races (db5c03d): delete-vs-expand silently skipping a delete,
tombstone resurrection losing a stored value, and Clean evicting newborn
entries — all hammer-reproduced; plus CompareAndDelete leaking tombstones
and NewSet/NewItem never starting the autoclean goroutine.
Expiry-word protocol (2f95c76, be44fed, b200842): a concurrent idle
extension could silently overwrite Expire or resurrect expired values;
Clean could discard a racing extension, evict entries whose stale was
still servable, or — through its claim window — resurrect a certified-dead
value as a fresh Stale. Fixed with a CAS-based claim protocol and a
dedicated claim sentinel; expiry math now saturates and the clock is
monotonic.
Stale-window state machine (be30581, the two High findings): the
stale field conflated the refresh companion with a value's own post-expiry
grace window, so TryGet and Get disagreed at expiry, older-than-freshest
values could be served, and born-expired re-anchoring served values of
unbounded age as Stale. Windows are now derived at read time from the
expires word, so every read, refresh, and Clean agree by construction; and
Expire can no longer move an expiry forward (the invalidation API could
extend data lifetime).
Linearizable reads and CAS (1afccb4, 0617c23, ac4d8d3): every liveness
judgment now uses a coherent (expiry word, clock) pair, and
CompareAndSwap/CompareAndDelete re-validate liveness immediately before
publishing — closing windows where a read could misclassify a
continuously-live entry or a CAS could succeed on a value TryGet already
certified as gone. The irreducible instructions-wide residue is documented
on CAS/CAD as best effort.
Docs (the rest): Range's no-snapshot contract, the Get family's
stale-return and miss-run conditions, MaxErrorAge's default under
MaxIdleAge, Clear's in-flight-load semantics, and the CAS/CAD caveats.
Two behaviors were documented as deliberate rather than changed:
Delete/Swap may report a concurrently finalizing load as the removed
value, and MaxErrorAge(0) waiters re-drive serially ("errors are never
cached").

Semantic shifts vs. read/dirty

Both are documented in the code itself:

Range loses its promote-snapshot flavor. Previously each promoted
under c.mu and ranged a fixed key set. Now Range is a lock-free walk;
the contract is sync.Map.Range's, stated on Cache.Range.
Clean no longer serializes on c.mu. Iteration is lock-free; per-key
physical removal takes only the trie's bucket lock.

Verification

Linearizability checker: concurrent Set/Swap/Delete/TryGet/CAS/
CAD batches against single keys, verified explainable by a
real-time-consistent sequential order. Expiry-adjacent findings are pinned
by deterministic scheduling hooks instead (unexported, nil outside tests).
Trie structural validator (hash placement, chain purity, parent pointers,
pruning completeness) at the quiesce point of every concurrent trie test.
Every behavioral fix has a regression test confirmed red against the
pre-fix code; hammer rates for the probabilistic ones are in the commit
messages.
CI: -race on amd64 and arm64, plus linux/386 without -race for the
32-bit hash-budget path. Coverage ~98.6–99%; the residue is documented
defensive panics.
At HEAD: go test -race ./... green; go vet clean on native and
linux/386.

Benchmarks

Apple M1, Go 1.26, -count=5 -benchtime=1s, benchstat comparing the
pre-trie baseline to the port commit. (Measured at 7c63a66; the audit rounds
since then added a pre-publish revalidation and paired clock reads on the
CAS/extension paths — worth a benchstat re-run before merge if those paths
matter to a consumer.)

Time per op — geomean -35.24%.

Largest wins:

Benchmark	Pre-trie	Trie	Δ
AdversarialAlloc	350.8 ns	11.4 ns	-96.74%
AdversarialDelete	132.2 ns	6.5 ns	-95.10%
SwapMostlyMisses	1032 ns	154 ns	-85.11%
LoadOrStoreUnique	1449 ns	592 ns	-59.17%
LoadAndDeleteBalanced	10.7 ns	4.4 ns	-58.66%
LoadAndDeleteUnique	4.3 ns	1.8 ns	-58.59%
LoadOrStoreBalanced	702 ns	318 ns	-54.77%
CompareAndDeleteMostlyHits	131 ns	91 ns	-30.63%
DeleteCollision	5.1 ns	3.6 ns	-29.23%
SwapCollision	338 ns	242 ns	-28.47%
CompareAndSwapNoExistingKey	2.5 ns	1.9 ns	-22.97%

Regressions:

Benchmark	Pre-trie	Trie	Δ
LoadAndDeleteCollision	7.1 ns	67.6 ns	+853.35%
CompareAndSwapCollision	10.1 ns	17.6 ns	+73.45%
LoadMostlyMisses	3.4 ns	4.6 ns	+33.40%

The *Collision regressions are intrinsic: the trie physically removes on
Delete (required to keep keys GC-eligible per TestIssue40999) so every
Delete takes the bucket lock; pre-trie deferred physical removal to the
promotion cycle. LoadMostlyMisses pays a small fixed cost for the hash walk
on every miss.

Allocations. AdversarialAlloc and AdversarialDelete drop to 0 B/op
(pre-trie: 55 and 36 — the cost of dirty-map promotion copying).
SwapMostlyMisses drops from 4 allocs/op to 2. Remaining per-op allocations
on the insert/swap paths are intrinsic: each operation may install a fresh
loading[V] carrying the singleflight machinery (Mutex, WaitGroup,
atomic counters) that a plain sync.Map does not need.

- Bump go.mod from 1.18 to 1.24. The code has required at least 1.19 since the switch to generic atomic types (atomic.Pointer[T], atomic.Int64, atomic.Uint32). 1.24 also unlocks maps.Copy, which replaces the hand-rolled copy loop in promote. - newStale: when the main entry's expiry is 0 (infinite cache) or -1 (immediate expiry), base the stale's lifespan on now() instead of producing a stale born at the Unix epoch. This was a latent bug: Get paths typically regenerated the stale via maybeNewStale before it was observed, but the initial stale set by Swap/Set was wrong. - Clean: call e.del() directly instead of c.Delete(k). c.each already promotes when necessary, so every entry the callback sees lives in the read map and needs no dirty-map bookkeeping or per-key relock. - Document that Expire is a no-op on in-flight loads for Cache, Item, and Set.

Test coverage climbs from 84.3% to 100% of statements. The new tests either target concretely uncovered behavior (Item/Set wrappers, autoclean lifecycle, Clean no-op on MaxStaleAge<0) or target race windows in the existing read/dirty implementation (Swap-vs-in-flight-Get, Delete-during-load, Expire-no-op-during-load, maxAge=0 collapsing, the promote CAS-retry path, Swap observing the promotingDelete sentinel, Get's unlocked-miss / locked-hit interleaving, CompareAndSwap against a just-promoted entry). Single -count=1 -race runs land 100% coverage reliably on this machine; the stress test driving the tightest windows runs for ~2s against many goroutines. The Swap-vs-in-flight-Get test caught a real data race (reported by -race). After Get's final e.get, the original code returned l.v without synchronizing with any writer: concurrent Swap's defer writes l.v under l.mu and calls l.wg.Done, but Get was only waiting on whatever loading e.p pointed to at the time, which could be Swap's replacement (a different loading). Add l.wg.Wait() before returning l.v to establish the happens-before edge with whoever finalized l (either setve or Swap's defer). New test files: - wrappers_test.go: every Item and Set method - autoclean_test.go: AutoCleanInterval goroutine lifecycle and StopAutoClean idempotency - concurrency_test.go: race-targeted tests and the miss/stale correctness properties called out in the trie port plan

Introduces cache/trie.go: a generic concurrent hash-trie map modeled on Go's internal/sync.HashTrieMap, but implemented entirely in userspace (hash/maphash for the runtime's typed hasher, unsafe.Sizeof(uintptr(0)) for pointer-size, no internal/abi or internal/goarch). The trie is not yet wired into Cache; that happens in a later commit. Design: - Root is a lazily-allocated indirect node, atomically swapped on Clear so readers holding the old root keep operating on it. - Interior nodes hold 16 atomic child pointers; leaves hold a key, an atomic value slot (*V), and an atomic overflow pointer for hash-collision chains. - Load is lock-free; LoadOrStore and Delete take the parent's per-bucket mutex. Delete prunes empty parents bottom-up while holding locks hand-over-hand. - expand splits a leaf into a subtree on hash-prefix collision; full hash collisions chain through overflow. Tests (cache/trie_test.go) cover: zero-value trie, basic store/load, load-or-store semantics with nil-value initialization, single and collision-chain delete (head and mid-chain), large-key stress (20k keys) through expand, full hash collisions via a test-only hashFn hook, concurrent disjoint keys, concurrent contention on the same key, concurrent delete-prune, range-during-mutation, and the deleteEntry lock-retry paths under sustained delete contention. Coverage is 99.1%. The five uncovered statements are defensive panics at invariant-violation points (tree deeper than the hash has bits, node type discriminator corrupted, etc.) that cannot fire without memory corruption; each is documented inline with the reason it is unreachable. Also: - Scale TestMaxIdleAge sleep windows up by ~4x so the timing-based assertions tolerate -count=3 -race load. The previous 30ms-within- 50ms-window spacing had no margin for scheduler delay. - Fix TestSwap_CancelsInFlightGet: synchronize on the miss goroutine actually completing (via a missDone channel) before asserting the miss ran. Previously the test asserted a flag set by a goroutine that hadn't been scheduled to run yet under heavy load. - Modernize remaining int-range for-loops.

The read/dirty/promote structure is replaced by the concurrent hash-trie from the previous commit. Public API, stale/TTL semantics, singleflight collapsing, and CompareAndSwap/Delete behavior are preserved. Shape of the rewrite: - Cache holds a single `trie[K, loading[V]]` instead of r/mu/dirty/ misses/pd. - An `ent[K, V]` type alias (Go 1.24 generic aliases) names the trie's concrete entry type so callers can read `*ent[K, V]` instead of `*trieEntry[K, loading[V]]` everywhere. All entry-manipulating logic (entGet, entTryGet, entDel, entLoad, entMaybeNewStale) is free functions over that alias, since Go generics do not permit methods on instantiation-specific aliases. - Delete physically removes the entry via the trie's deleteEntryIf primitive (with a "still tombstoned under lock" predicate to avoid clobbering a concurrent Swap resurrection). This lets keys and their referents be GC'd, preserving TestIssue40999. - Clean walks the trie, tombstones expired entries, and then calls deleteEntryIf for each tombstone so the trie's memory footprint does not grow unboundedly with delete churn. - Range iterates via the trie walk; no snapshot (matches sync.Map and the existing Range docstring). - Swap's fast path CAS on an existing entry's value slot is preserved without any global lock; the slow path goes through loadOrStoreEntry and resolves races on the trie's per-bucket mutex. Added to the trie for Cache's benefit: - deleteEntryIf(k, pred): removes only if pred returns true under the parent bucket's lock. Callers use this to avoid the resurrected-during-delete race. - Inline chain removal in deleteEntryIf, retiring the separate trieRemoveFromChain helper (it had a "not found" return path that became dead after the pred-and-locate restructure). Two new tests beyond the existing suite: - TestGet_ConcurrentStaleDuringInFlight: a second Get observes a stale-returning loading that is already in flight from the first Get and returns the stale without attempting its own miss. - TestGet_StaleRefreshCASRace: 16 concurrent Gets on an expired-with- stale key exercise the slow-path CAS-retry that replaces a finalized prev loading with a fresh loading carrying a stale snapshot. - TestClean_SkipsInFlightLoads: Clean must skip entries whose load has not finalized. - TestTrie_DeleteRaceAgainstExpand: deleteEntryIf's "slot changed to non-entry after lock" branch, triggered by concurrent insert on hash-prefix-colliding keys forcing an expand during delete. Full suite passes under -race; coverage settles at 98.6%-99.0% of statements across independent runs. The remaining uncovered lines are documented defensive panics (invariant violations that require corruption to fire) and two tightly-timed race branches that need scheduler interleavings we cannot force from userspace without test hooks. Benchmark comparison against the pre-trie baseline is deferred to the cleanup commit.

Update the design blurb (no more read/dirty promotion), the coverage footnote (97% → ~99% plus a note on the defensive-panic residue), and the microbenchmark section (fresh numbers from the current code on Apple M1; the old table compared sync.Map to the pre-trie cache and is no longer meaningful). Flag that per-op allocations on the insert/swap paths are intrinsic — the singleflight machinery (loading[V] with a Mutex, WaitGroup, and atomic counters) is something sync.Map doesn't carry, and carrying it costs a single heap allocation per inserted key.

Add three tests and one benchmark that exist in sync/map_test.go and sync/map_bench_test.go but were not in our copy: - TestConcurrentClear: 10 writers, 10 readers, and 10 Clear goroutines run concurrently. Correctness is no panic and no phantom keys (keys we never Stored appearing in Range after the dust settles). - TestMapClearOneAllocation: asserts Clear allocates ≤1 time. Cache's Clear replaces the trie root with one fresh indirect node, matching the sync.Map guarantee. - TestMapRangeNoAllocations: asserts Range does not allocate. The benchmarks already showed 0 B/op but an explicit AllocsPerRun test guards against a regression in the Range closure wiring. - BenchmarkClear: Clear throughput. Skipped from stdlib (intentionally): - TestMapMatchesDeepCopy: a second mapInterface oracle (DeepCopyMap). Our TestCacheMatchesRWMutex already covers the semantics; the stdlib keeps a second oracle mainly to catch sync.Map-specific promotion bugs we no longer have. - TestMapMatchesHashTrieMap: stdlib's internal hash trie as the oracle. We are the hash trie. - TestHashTrieMap{BadHash,TruncHash}: pathological-hash tests. Our TestTrie_FullHashCollision{,DeleteHead} already exercises full collisions via the hashFn hook. - BenchmarkHashTrieMapLoad[Small|Large], LoadOrStore[Large]: trie- level microbenchmarks. Redundant with the cache-level benchmarks since Cache is thin over trie.

An audit against internal/sync.HashTrieMap (which trie.go is modeled on) and the prior read/dirty implementation found three interleaving bugs, each reproduced with hammer tests before fixing: 1) trie.deleteEntryIf gave up (returned nil) when the slot it walked to was expanded into an indirect node between its unlocked walk and its lock, silently skipping a delete of a present key (3,109/200k rounds). Upstream's find() retries from the root in this case; we now do too. 2) Tombstoned entries (nil value slot) could be resurrected by Swap/Get via a bare CAS(nil, loading) that nothing serialized against Delete's and Clean's pred-then-unlink, so a concurrent Set/Swap could land in an entry that was then physically unlinked: the swap reported success (Miss = "stored") and the value vanished (66/200k rounds). The old read/dirty code was immune via the promotingDelete sentinel, which made tombstone removal a single CAS; the port lost that atomicity. A nil value slot is now terminal: no code path ever stores over it. Writers that find a tombstone unlink the dead entry and re-create the key as a fresh entry, which makes deleteEntryIf's nil check stable under the bucket lock. Delete also now derives its return value from the loading it actually displaced, so the value reported deleted is the value removed. 3) Get created its entry with loadOrStoreEntry(k, nil) and CAS'd the loading in afterwards; in that window a fresh, never-expired entry was indistinguishable from a tombstone and Clean would evict it (1/500k rounds), also breaking request collapsing for that key. Get now creates entries with the loading already in place; a linked entry never has a nil slot except as a tombstone. Also: CompareAndDelete now physically unlinks like Delete (tombstones no longer leak until Clean); NewSet/NewItem now actually honor AutoCleanInterval (Set.StopAutoClean previously stopped a goroutine that was never started) and Item gains Clean/StopAutoClean; onlyFinalized is collapsed into finalized (the third state it guarded against retired with the read/dirty machinery); stale comments describing promote/read/ dirty are rewritten; TestTrie_DeleteRaceAgainstExpand now asserts the delete succeeds (it previously only asserted no-crash, exercising bug 1 without catching it); regression tests added for 2 and 3.

The expand-race test's hash literals overflowed uintptr on 32-bit platforms, so 'go vet' / compiling the tests failed there (caught by a GOARCH=386 cross-vet; the library itself was fine). Build the constants with shifts off ptrBits so the test exercises the same top-4-bits collision shape at either word size.

Expire could be silently overwritten by a concurrent idle extension: the extension stored a fresh expiry computed from a clock sampled BEFORE loading the expiry, so an Expire whose now()-1 landed between the two samples looked unexpired and was extended over. Extension now (a) loads the expiry before sampling the clock, so any Expire value it observes is necessarily in its past, and (b) CASes against the observed expiry, so an Expire landing between the load and the write wins. Notably, fix (b) alone was not enough — the regression hammer caught the residual (a) window at ~1/2000 rounds. CompareAndSwap/CompareAndDelete now agree with TryGet about what exists: expired entries and errored entries (whose v is just the miss function's placeholder return) no longer match. Get's docs now state the miss function's panic behavior (it runs on an internal goroutine; a panic crashes the process) and the same-key re-entrancy deadlock. New verification layers: - A linearizability checker: batches of concurrent Set/Swap/Delete/ TryGet/CAS/CAD against single keys, verified to be explained by some real-time-consistent sequential ordering. This catches atomicity bugs generically rather than per-interleaving (the prior tombstone- resurrection bug class fails this checker directly). - A trie structural validator (hash placement, overflow chain purity, parent pointers, dead flags, pruning completeness) run at the quiesce point of every concurrent trie test. - GitHub Actions CI: -race on amd64 and arm64 (weaker memory ordering surfaces reordering bugs x86's TSO hides), plus linux/386 without -race (unsupported there) to exercise the 32-bit hash-budget path. The README test-suite paragraph is rewritten to describe the actual layers and the measured 98% coverage.

Implements the second audit round (independent session), verified by a review pass with full interleaving traces and mutation-tested regressions: - now() is monotonic (time.Since of a process epoch, padded 1s clear of the 0 sentinel). This makes the load-expiry-then-sample-clock ordering argument airtight rather than dependent on wall-clock sanity, and immunizes TTLs against NTP steps. Trade-off (suspend pauses aging) documented on epoch. - Idle extension can no longer resurrect expired values: the extension CAS is gated on !expired, fixing the waited-expired path where a MaxAge(0) collapse value was revived for the idle window; and the driving Get's tail no longer extends (it publicly returns Miss, not a Hit, and extension was silently rewriting the MaxAge-born TTL). - Clean and idle extension now contend on the expiry word: Clean claims it (CAS to the expired sentinel) before tombstoning, so a racing extension either wins (entry kept) or loses the same way it loses to Expire, instead of being silently thrown away by the eviction. - Clean no longer evicts entries whose stale is still being served: for born-expired entries (MaxAge(0)/MaxErrorAge(0) store -1) the grace window anchored at the epoch rather than the entry. Pre-existing bug found while verifying the Clean change. - Get's replace decision is re-derived from prev itself, and the stale piggyback is pinned to the slot it read, so a load that finalizes live in the window between Get's slot read and entGet's cannot be demoted to a stale and redundantly re-driven. - Delete and Swap classify the displaced value with a clock sampled at the displacing CAS, not after, so a value removed live is not misreported as expired. - All expiry arithmetic saturates at MaxInt64 (satAdd), so huge Durations mean "effectively forever" instead of wrapping negative into born-expired (or, in Clean, evict-everything). - Doc fixes: NaN / non-comparable key behavior, MaxErrorAge's no-throttling-with-stales caveat, MaxIdleAge best-effort note, MaxStaleAge without MaxAge, autoclean pinning the cache, terminal- tombstone contract on trieEntry, StopAutoClean/Set wording. All four new regression tests fail against the reverted implementation (the redrive test by double-close panic); the full -race suite, the linearizability checker, and the prior round's hammers pass on amd64, with vet clean on linux/386.

Third audit round, independently verified (interleaving traces of every expires-word consumer against the new sentinel; mutation check: the regression tests fail against the reverted implementation): - Clean now claims an entry's expiry word with a dedicated sentinel (expiredClaimed, -2) instead of -1. -1 is the born-expired sentinel, which newStale deliberately re-anchors at now — so a Get snapshotting a stale in the window between Clean's claim and its tombstone CAS would resurrect the certified-dead value as a fresh Stale for up to another MaxStaleAge. entMaybeNewStale treats a claimed word as "no stale". Pinned deterministically by performing Clean's claim by hand (TestClean_ClaimWindowDoesNotResurrectDeadStale) plus a probabilistic hammer against the real Clean. - Born-expired entries (MaxAge(0) / MaxErrorAge(0) store -1) are now reclaimable as soon as their stale dies: their grace window was computed as satAdd(-1, maxStaleAge), anchored at the process epoch, so stale-less dead entries (error churn under MaxErrorAge(0)) could not be reclaimed until process uptime exceeded MaxStaleAge — and never, for huge stale ages. Reclaim is gated on TryGet visibility: value expired and stale dead. - Idle extension re-samples the clock immediately before publishing, shrinking the revive-after-expiry window from "whole deschedule since the expiry check" to a few instructions; the residual window is documented as best effort on MaxIdleAge. - Doc accuracy: Swap/Set "load is canceled" -> "load's result is discarded (the miss function is not interrupted)"; TryGet docs now cover expired-with-valid-stale returning Stale; MaxStaleAge accepts any negative as "forever" and documents the store-anchored stale window of Expired-but-unrefreshed keys; Item/Set Clean docs match Cache.Clean's actual rules.

…orward Fourth audit round (expiry/stale state machine), independently verified: all seven of the audit's repro tests were confirmed red against b200842 and are green now; eleven regression tests are ported into the suite. The round found three root causes composing into five bugs, two High: - The stale field conflated two meanings: the refresh companion (the previous generation, served while a replacement loads or errors) and a value's own post-expiry grace window. Only Set/Swap values got a self window (finalizedLoading aliased the companion to the value); Get- loaded values had none — TryGet missed at the instant of expiry while Get served a stale for the same state — or carried the previous generation, letting TryGet serve an older value than the freshest dead one. Idle extensions moved the expiry but not the frozen window, so long-lived entries died with their promised window already closed. The self window is now derived at read time from the expires word (staleOpen: anchor + MaxStaleAge, where anchor is the expiry, the birth for born-expired values, or the Expire timestamp), used identically by entGet, loadingTryGet, Swap's classification, and Clean's reclaim gate, so every read and refresh agrees by construction. The stale field reverts to a pure refresh companion (in-flight and errored reads only); finalizedLoading stores none. - Born-expired words (-1) destroyed the only datum that could anchor their window: entMaybeNewStale re-anchored a certified-dead value's stale at every Get-driven refresh, serving values of unbounded age as Stale (TryGet: Miss, Get: Stale, no race required). The word now carries the negated birth (-now(); the epoch pad keeps it clear of expiredClaimed), and the window anchors at the birth everywhere. - Expire blindly stored now()-1, moving already-expired words forward — re-anchoring dead stale windows (the invalidation API extending data lifetime) and overwriting Clean's claim sentinel inside its claim window. Expire now CASes and only ever moves the word backward; expiring an expired entry is a no-op, documented. Documented, not changed: Delete/Swap may report a concurrently finalizing load as the removed value (closing it costs more than the return triple is worth), and under MaxErrorAge(0) collapsed waiters on an erroring load re-drive serially (the literal "errors are never cached" contract; sharing completed errors is a design change deserving its own decision). Plus doc accuracy: MaxStaleAge's uniform anchoring, MaxIdleAge shortening on access, errors-cached-forever default, and the expires-word grammar at expiredClaimed.

Fifth audit round (the read-time stale window machinery), one Low: - A read assembled its classification from two separate loads: the expires word, then the clock. A reader descheduled between them while an idle extension landed classified the dead pre-extension word at a fresh clock — TryGet reported Stale (with MaxStaleAge) or Miss (without) for an entry that was live at every instant of its call, and a later TryGet returned Hit with no intervening write, a history no sequential order explains. Get healed itself by re-deriving in its slow loop; TryGet surfaced it. expiresNow now pairs the word with the clock coherently: load, sample, confirm the word unchanged, re-deriving on movement. Each retry re-loads before re-sampling, preserving the ordering that classifies a concurrent Expire as expired rather than extending over it. Both readers use it; Range's and Delete's caller-pinned clocks predate the word load, so that inverted pairing is already conservative. The repros hold the deschedule window open via expiresPairingHook, an unexported nil-by-default test hook (the trie hashFn precedent), and were confirmed red against the pre-fix code. - The finalized classification was hand-mirrored across loadingTryGet, entGet, and Swap's displaced-value report — a divergence class that already bit once. classifyFinalized is now the one shared ladder; entGet's waited-courtesy precheck stays at its call site, and the idle-extension epilogue keys off the classified state. Doc precision: an Expire racing the entry's natural death can land just after it and trim (never grow) the dead value's remaining stale window; non-positive MaxIdleAge is ignored entirely. Plus five pins for cells the suite did not assert: Swap displacing a born-expired value (both sides of the birth-anchored window), Expire on an expired-but-in-window word (byte-identical no-op, positive and negated-birth words), Clean keeping an expired error whose companion is still servable, exact-nanosecond window boundaries across staleOpen, newStale, and stale.expired, and the safe (backward) direction of the paired read.

Sixth audit round (full re-read of cache.go and trie.go plus every test file, all six scenario classes re-traced), one Low: - tryCAS sampled the matched value's liveness once, then allocated the replacement loading, then published with the pointer CAS — the whole allocation (the widest deschedule risk in the call) sat between the liveness sample and the publish. A CAS caller parked there while the value's TTL lapsed still published: TryGet had already certified Miss, and no sequential order explains TryGet=Miss followed by CAS=true (CAS-first requires the TryGet to see the freshly installed value; TryGet-first requires the CAS to fail on an expired match). The window cannot be closed outright: a successful swap takes effect at the pointer CAS, and the clock cannot be read at that exact instant — pre-claiming the expiry word would instead delay observable expiry, and post-CAS rollback can race a reader that already observed the new value. So this follows the MaxIdleAge precedent: tryCAS now re-validates immediately before every CAS attempt, shrinking the window from "alloc + deschedule" to a few instructions, and the residue is documented on CompareAndSwap and CompareAndDelete (the mismatch direction stays exact). The repro holds the deschedule open via casPublishHook (nil-by-default, the expiresPairingHook precedent) and was confirmed red against the pre-fix ordering for both CAS and CAD. The linearizability checker deliberately runs an expiry-free config (expiry is a clock transition, not an operation), so the hook pin is the vehicle for this finding, as in rounds three through five. Doc accuracy: AutoCleanInterval now states that with a negative MaxStaleAge the goroutine is not started at all (Clean would be a no-op; pinned by TestAutoClean_DisabledWhenStaleAgeNegative); Item's CompareAndSwap/CompareAndDelete point at the Cache caveat. Everything else re-verified clean this round: tombstone terminality under the bucket lock, the claim sentinel against every expires-word consumer, derived stale windows being absolute intervals (a late snapshot cannot extend one), prune lock ordering (child linked implies parent not dead), Get's pin-and-re-derive slow path, and the WaitGroup/mutex finalize-once protocol.

Seventh audit round (full re-read of cache.go, trie.go, and every test file; all six scenario classes re-traced), two Lows, both doc accuracy on the same caveat surface: - The sixth round's commit message claims Item's CompareAndSwap and CompareAndDelete "point at the Cache caveat", but that change never made it into the diff: the Item docs carried neither the stale-values-not-considered sentence nor the best-effort expiry caveat. Both now point at the Cache method docs, following the Item.Get "See Cache.Get" precedent. - The Cache caveat described the best-effort residue as "a value whose TTL lapses while the swap is in flight"; the same instructions-wide window also lets a concurrent Expire's kill slip through, and that direction is otherwise unexplainable (Expire-first requires the CAS to fail on the dead match; CAS-first requires the Expire to kill the swapped-in value, which a later TryGet=Hit contradicts). The window is irreducible for Expire exactly as for TTL lapse — Expire writes the expiry word while the publish is a pointer CAS, and the sixth round already weighed and rejected pre-claiming the word and post-CAS rollback — so the caveat now says expiry generally: TTL or Expire. The wide (allocation) side of the Expire direction is exact, not best effort: the pre-publish re-validation loads the expiry word fresh, so an Expire landing while the caller is parked there is caught. Pinned by TestTryCAS_RevalidationCatchesExpire via casPublishHook (the TTL sibling only lapses a deadline), confirmed red against the pre-revalidation ordering for both CAS and CAD. Everything else re-verified clean this round: the claim sentinel against Swap/Delete/Expire/extension (including a benign by-value ABA on the expires word: a claim landing after extend-then-Expire returns the word to its evaluated value is semantically identical to the no-race claim), tombstone-shell unlink-then-recreate under every writer path (never two live entries for one key), walk's single-visit-per-key claim under expand and prune, Clear against in-flight writers on the old root, the error/companion classification ladder, and Clean's conservative companion gate for err-free values (provably never outlives the derived self-window gate).

Eighth audit round (full re-read of cache.go, trie.go, and every test file; all six scenario classes re-traced), one Low: - casMatches sampled the clock before loading the expiry word (!l.expired(now())): now() was evaluated as the argument, then expired loaded the word. A CAS caller descheduled between the two judged a fresh word against a stale clock, so a kill that fully completed while it was parked went unseen: for Expire, the stored word n_e-1 compares above the stale clock; for a lapsing TTL, the fixed word never falls below a clock sampled before the lapse. Either way the publish lands against a value TryGet already certifies as Miss, and no sequential order explains kill, TryGet=Miss, CAS=true, TryGet=Hit(new) (a CAS-first order requires the final TryGet to miss the freshly swapped value; kill-first orders require the CAS to fail on the dead match). The sixth round closed the allocation-side window with the pre-publish re-validation, but the re-validation itself kept this instructions-wide clock-vs-word gap open on both kill modes. casMatches now takes its (word, clock) pair from expiresNow, exactly as every read path does: the clock postdates the word it judges, so any kill that completes before the pair is taken is always seen, and an idle extension racing the pair re-derives instead of spuriously failing a continuously-live match (the round-five misclassification class, which is why the simpler unconfirmed word-then-clock ordering was not an option). The residue is now exactly the instructions between the re-validation and the pointer CAS, which is what the CompareAndSwap / CompareAndDelete docs already document as best effort. Pinned by TestTryCAS_RevalidationPairsWordAndClock via a counting expiresPairingHook (parks the re-validation's paired read, the call's second), covering {CompareAndSwap, CompareAndDelete} x {TTL lapse, Expire}; all four subtests confirmed red against the pre-fix ordering with the hook held at the old deschedule point. The linearizability checker deliberately runs expiry-free, so the hook pin is the vehicle, as in rounds three through seven. Everything else re-verified clean this round: every other expiry-word read is either paired (entGet, TryGet), CAS-guarded (Expire, Clean's claim, idle extension), pinned to a deliberately conservative clock (Range, Delete, Swap's displaced-value report), or feeds a retry loop that re-derives before acting (Get's replace gate); the claim sentinel, tombstone terminality, walk single-visit, and prune lock ordering all re-traced clean. Full -race suite green; vet clean on native and linux/386.

Doc accuracy, from the eighth audit round: Cache.Range promised "calls fn for every cached value" with no concurrency caveat. The trie walk takes no snapshot — the old read/dirty implementation ranged a promoted read map, which did fix the key set at the start of the call, so the port silently weakened Range's consistency while the doc stayed one line. State the actual contract, in sync.Map.Range's terms: no key is visited more than once (pinned by TestConcurrentRange's seen-twice check under churn), a concurrently mutated or expiring key may be reflected from any point during the call (Range classifies every entry at one clock pinned at entry), and Range does not block other methods — fn may re-enter the cache (pinned by TestMapRangeNestedCall). Set.Range points at the Cache caveat, following the Item "See Cache.Get" precedent.

Doc accuracy, from the ninth audit round: the Get family's stale sentence covered only the errored case — 'if stale values are enabled, the currently cached value has an error, and there is an unexpired stale value, this returns the stale value and no error' — wording inherited verbatim from the read/dirty implementation. The third round taught TryGet's doc that an expired value with an open stale window is returned as Stale, and the fourth round made that window a first-class read state (derived from the expiry word, agreed on by every read and refresh), but Get's doc never learned it: a Get on an expired value with an open window returns the value as Stale immediately while driving the refresh, and it re-runs the miss function for an expired key in the first place — neither of which the doc stated. KeyState's Stale doc and MaxStaleAge already describe both kill modes, so Get was the one surface left claiming the narrower contract. Get, Item.Get, and Set.Get now say it the way TryGet does: the miss runs if the key is not yet cached or the cached value has expired, and an unexpired stale — the value errored, or expired with its window open — is returned with no error. Behavior unchanged and already pinned: TestExpires asserts Get returns (v, nil, Stale) for an expired-with-window value, and TestGet_ConcurrentStaleDuringInFlight pins the piggybacked flavor.

Doc accuracy, from the ninth audit round: MaxErrorAge claimed 'with neither option set, a load error is cached forever and never retried'. The fourth round wrote that sentence weighing only MaxAge; it never weighed MaxIdleAge, whose initial-TTL rule in newExpires applies to every entry — the err branch falls through to the same ttl derivation, so with only MaxIdleAge set a load error is cached for the idle age and then retried, not cached forever (verified live: error Hit within the window, Miss after, next Get re-drives). The behavior is the sensible one — an idle-config cache should not pin errors permanently — so the doc moves to the code: the default is MaxAge, or the idle age when only MaxIdleAge is set (errors share the initial TTL but are never idle-extended; extension is gated on err == nil everywhere). The forever claim is rescoped to neither-MaxAge-nor-MaxIdleAge. Pinned by TestMaxErrorAge_DefaultsToIdleInitialTTL: cached error is a Hit within the idle window, Miss after it lapses, and the next Get drives a fresh load. The no-extension half was already pinned for the MaxAge config by TestMaxIdleAge/no_extend_errors; the new test's early probe runs immediately after the load so even a regressed extension could not outlive the post-window probe.

Doc accuracy, from the tenth audit round: Get's first sentence claimed the miss function runs "if the key is not yet cached or the cached value has expired" — an exhaustive-reading condition that misses the third trigger. A cached error whose own age has not lapsed is replaced and re-driven whenever a valid stale exists: the slow path's replacement filter passes any finalized errored loading regardless of expiry, exactly as MaxErrorAge documents ("a cached error suppresses repeat queries only when there is no stale value to return ... always drives a refresh, regardless of the error's remaining age"). The ninth round's rewrite (052940d) brought the expired-with-stale return into Get's doc but enumerated only the two run triggers, leaving Get contradicting MaxErrorAge for live errors. Get, Item.Get, and Set.Get now list the third trigger, with the MaxErrorAge rationale parenthesized on Cache.Get. Behavior unchanged and already pinned: TestExpires's stale-while-error block returns Stale from a Get whose miss function is concurrently executing (blocked on a channel) while the cached error is unexpired (no error age is configured there, so the error would otherwise be cached and served forever).

Doc accuracy, from the tenth audit round: Clear's doc said only "deletes all keys, resetting it to an empty state", leaving its interplay with in-flight loads underivable. Clear swaps the trie root, so a load in flight at the Clear keeps running against the detached tree: its result still reaches every Get waiting on it (the waiters hold the detached entry and its loading directly), but it is never cached — a later Get for the same key misses and drives a fresh, independent load. This is the per-key Delete caveat applied wholesale, but nothing said Clear behaves like a bulk Delete rather than, say, canceling loads or letting their results re-land in the emptied cache. Cache.Clear now states it; Item.Clear and Set.Clear point at it, following the Item.CompareAndSwap "see Cache.X" precedent. Pinned by TestClear_DoesNotCacheInFlightLoads, the Clear sibling of TestDelete_DuringInFlightLoad: with the miss held open across the Clear, the waiter still receives the load's value, and the post-Clear Get re-runs the miss instead of finding the detached result.

twmb added 21 commits April 21, 2026 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache: replace read/dirty with a concurrent hash-trie#6

cache: replace read/dirty with a concurrent hash-trie#6
twmb wants to merge 21 commits into
mainfrom
trie-cache

twmb commented Apr 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

twmb commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

The port (first six commits)

Hardening (remaining fifteen commits)

Semantic shifts vs. read/dirty

Verification

Benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

twmb commented Apr 21, 2026 •

edited

Loading