Picking up the deref-lock optimization you flagged as a follow-up in #210 ("your top-of-stack-cache / clear-isDynamic-on-empty suggestion is worth doing, but as a follow-up with numbers driving it"). I built one version, measured it, and found the decision forks two ways, and which way is right depends on let-go's real read/write ratio for dynamic vars, which you'll have a better feel for than I do.
Cost
Since isDynamic is the declaration flag, it stays set for the var's whole life: SetDynamic() is called for every ^:dynamic var at compile time (compiler.go:1700), and any push sets it too. So ExecContext.deref consults the per-context binding stack on every read of a dynamic var, taking bindingStack.mu even when the var has no active binding in this context. For *out*/*ns*/etc. that's a lock on a hot path.
That also rules out a literal "clear-isDynamic-on-empty": clearing it would break the declaration semantic (a ^:dynamic var with no current binding must still report dynamic and must still be checked, because it can be bound at any time). So the lever is the lookup, not the flag.
Option A — copy-on-write binding map
Built and measured this. Hold the binding map behind an atomic.Pointer, immutable once published. Reads load it lock-free; writers (push/pop/setCurrent/installSnapshot) serialize on the retained mutex, copy the map, and atomic-swap. Per-context isolation is unchanged. Full suite + -race green.
Reads (benchstat, n=8, local M-series; treat the deltas as the signal, absolutes will differ on your box):
before after delta
VarDerefPreviouslyBound 18.2 ns 6.7 ns -63% (declared dynamic, no active binding — the common *out* read)
VarDerefPreviouslyBoundParallel 98.2 ns 11.7 ns -88%
VarDerefBound 22.8 ns 14.5 ns -36% (read inside an active binding)
VarDerefBoundParallel 118 ns 17.6 ns -85% (the contention you raised)
VarDerefRoot / RootParallel / DistinctParallel unchanged (already lock-free)
geomean -57%
The cost lands on the write side: every binding establishment now allocates fresh maps:
BindingPushPop 84 ns -> 364 ns (+335%) 16 B -> 704 B (+43x) 1 -> 7 allocs
So A fixes every read path, including the parallel-bound contention you called out, but makes (binding [...] ...) establishment more expensive. Whether that's a good trade depends on how binding-heavy real workloads are.
Option B — per-context active-binding counter
Described, not yet built. Keep isDynamic as the declaration flag, but add an atomic count of active bindings to each bindingStack, and gate the lock: if v.isDynamic.Load() && ec.bindings.count > 0. A context with no active bindings reads every dynamic var lock-free; the write path is unchanged but for one atomic add. This is closer to the "clear-on-empty" framing: it returns a context to the fast path when its stack drains.
The trade is narrower: B fixes the common unbound read (VarDerefPreviouslyBound, roughly to root speed) at ~zero write cost, but leaves reads inside an active binding (including VarDerefBoundParallel) on the lock, since the counter is non-zero there. I can build and measure it if the trade looks right to you.
The decision point
Which tradeoff fits let-go? A buys lock-free reads everywhere (and kills the bound-parallel contention) at a real write-path regression; B is cheaper and lower-risk but only covers the unbound read. My lean is B as the conservative default: don't regress the write path without evidence the reads it buys are hot. But you raised the bound-parallel contention specifically, which only A addresses, so I didn't want to pick for you.
A is on perf/bound-deref-lock on my fork if you want to check it out and run the benches. Happy to build B for a side-by-side, or to take this whichever direction you prefer once #210 lands (this stacks on it).
Picking up the deref-lock optimization you flagged as a follow-up in #210 ("your top-of-stack-cache / clear-
isDynamic-on-empty suggestion is worth doing, but as a follow-up with numbers driving it"). I built one version, measured it, and found the decision forks two ways, and which way is right depends on let-go's real read/write ratio for dynamic vars, which you'll have a better feel for than I do.Cost
Since
isDynamicis the declaration flag, it stays set for the var's whole life:SetDynamic()is called for every^:dynamicvar at compile time (compiler.go:1700), and any push sets it too. SoExecContext.derefconsults the per-context binding stack on every read of a dynamic var, takingbindingStack.mueven when the var has no active binding in this context. For*out*/*ns*/etc. that's a lock on a hot path.That also rules out a literal "clear-
isDynamic-on-empty": clearing it would break the declaration semantic (a^:dynamicvar with no current binding must still report dynamic and must still be checked, because it can be bound at any time). So the lever is the lookup, not the flag.Option A — copy-on-write binding map
Built and measured this. Hold the binding map behind an
atomic.Pointer, immutable once published. Reads load it lock-free; writers (push/pop/setCurrent/installSnapshot) serialize on the retained mutex, copy the map, and atomic-swap. Per-context isolation is unchanged. Full suite +-racegreen.Reads (benchstat, n=8, local M-series; treat the deltas as the signal, absolutes will differ on your box):
The cost lands on the write side: every binding establishment now allocates fresh maps:
So A fixes every read path, including the parallel-bound contention you called out, but makes
(binding [...] ...)establishment more expensive. Whether that's a good trade depends on how binding-heavy real workloads are.Option B — per-context active-binding counter
Described, not yet built. Keep
isDynamicas the declaration flag, but add an atomic count of active bindings to eachbindingStack, and gate the lock:if v.isDynamic.Load() && ec.bindings.count > 0. A context with no active bindings reads every dynamic var lock-free; the write path is unchanged but for one atomic add. This is closer to the "clear-on-empty" framing: it returns a context to the fast path when its stack drains.The trade is narrower: B fixes the common unbound read (
VarDerefPreviouslyBound, roughly to root speed) at ~zero write cost, but leaves reads inside an active binding (includingVarDerefBoundParallel) on the lock, since the counter is non-zero there. I can build and measure it if the trade looks right to you.The decision point
Which tradeoff fits let-go? A buys lock-free reads everywhere (and kills the bound-parallel contention) at a real write-path regression; B is cheaper and lower-risk but only covers the unbound read. My lean is B as the conservative default: don't regress the write path without evidence the reads it buys are hot. But you raised the bound-parallel contention specifically, which only A addresses, so I didn't want to pick for you.
A is on
perf/bound-deref-lockon my fork if you want to check it out and run the benches. Happy to build B for a side-by-side, or to take this whichever direction you prefer once #210 lands (this stacks on it).