feat(vardiff): replace threshold-ladder with adaptive EWMA algorithm#2188
feat(vardiff): replace threshold-ladder with adaptive EWMA algorithm#2188gimballock wants to merge 1 commit into
Conversation
a1871c7 to
9308b4d
Compare
Clean extraction of the best-performing vardiff algorithm from the simulation framework in stratum-mining#2154, with all test scaffolding, traits, and alternative algorithm implementations removed. The previous VardiffState used a fixed time-dependent threshold ladder and full retarget. This produced: - 6.6% median settled error (p99: 30% at low SPM) - 5–9 minute cold-start convergence (p90) - 33% detection rate for 10% hashrate declines (thermal throttle, failing ASICs) - 28% target overshoot during cold-start ramp (p99 at SPM 6) The new algorithm (EWMA + adaptive boundary + accelerating partial retarget): - Settled accuracy: <3% median error across all SPM - Cold-start overshoot bounded to <10% (was 28%) - Jitter: 0.03 fires/min at low SPM (was 0.06) — half the unnecessary retargets - Small-change detection: 85% reaction to -10% steps at SPM 6 (was 33%) - Transient disconnects recover in 1–2 fires rather than requiring a full cold-start ramp (20%/fire partial retarget vs old algo's 50–67% slash) - Asymmetric cost: loosening fires 3x faster than tightening, because loosening is free but tightening rejects in-flight shares Breaking: adds private fields to VardiffState (previously all-pub). Requires channels_sv2 major version bump. Public constructor API (new, new_with_min) and Vardiff trait interface are unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9308b4d to
c6d1d7d
Compare
|
We discussed this PR at @TriangleBitDevs yesterday and I think we came away with some valuable feedback. Shoutout @gimballock for presenting your work!
I don't believe the asymmetric boundary cost is a useful optimization. This seems to stem from a misunderstanding of how difficulty adjustments interact with existing jobs. Difficulty updates do not invalidate existing jobs. Those jobs remain valid until an The lines of code that prove it:
I think we also exposed a similar misunderstanding of how shares rejected due to low difficulty impact profitability of a miner. They don't. Any shares rejected due to low difficulty do not represent lost value in aggregate because shares found at or above the new difficulty threshold are worth proportionally more. In other words, the shares that pass the new difficulty threshold make up for the lost value of any shares that do not pass the new threshold. Rejected shares are a usability problem because they seem to indicate an error to the human monitoring these metrics. Assumptions that these shares lead to lost value for the miner arise from misunderstanding the nuances of share value calculation. |
You're right that the pool's The asymmetric boundaries aren't justified by that rationale though — they're justified by our simulation framework results. We swept What the asymmetry actually does: it suppresses false tightenings from lucky streaks, reducing steady-state jitter and preventing large upward difficulty jumps. The cost is slower convergence when tightening is genuinely needed (e.g. hardware upgrade). One caveat we're transparent about: our fitness metric weights stability heavily (jitter + step_magnitude_safety = 50%) over reactivity (25%) and convergence (15%). Under more balanced weights (30/30/30/10), the convergence penalty gets amplified 2x while the stability gains get discounted — a lower multiplier like 1.5–2.0 would likely rank higher, producing faster adaptation to genuine hashrate increases at the cost of more false fires and higher jitter on transient spikes. So the 3.0 constant is optimal given our prioritization, not in any absolute sense - it optimizes for "never surprise the miner" over "track hashrate changes quickly." We're open to discussing whether the weighting reflects the right tradeoffs for the upstream project, and whether a lower multiplier (with its faster convergence but higher churn) better serves the broader community of pool operators. |
Summary
Replace the threshold-ladder vardiff algorithm with an adaptive composition that addresses long-standing issues with difficulty adjustment quality.
This is the clean, production-ready extraction of the best-performing algorithm from the simulation framework in #2154. All test scaffolding, alternative implementations, and trait abstractions from that research branch have been removed — this PR is a single flat struct replacement of
VardiffState.Motivation
The existing vardiff was observed producing excessive jitter at moderate share rates and sluggish reaction to genuine hashrate changes at high share rates. @adammwest suggested Poisson confidence interval bounds as a statistically-grounded alternative to the fixed threshold ladder.
Investigation with the simulation framework (#2154) revealed that these issues can't be fixed by tuning the existing algorithm's parameters — they're limitations of its design:
Reactivity degrades with share rate. After the algorithm converges, it keeps averaging over an ever-growing window. When a miner's hashrate drops, the old stable data dilutes the new signal. At 60+ shares/min, only 9–16% of genuine -50% hashrate drops are detected within 5 minutes.
Jitter and reactivity are coupled. The fixed 15% threshold (used when difficulty hasn't changed for at least 5 minutes) is simultaneously too loose at high share rates (noise floor is only 3–9%, so the algorithm fires on random variance) and too tight at low share rates (genuine changes produce deviations below 15% because there just aren't enough shares to measure precisely).
Full retarget overshoots on noisy estimates. When the algorithm does fire, it jumps the full distance to its estimate - which is based on whatever shares happened to arrive since the last retarget, however noisy that sample is. During cold-start or right after a hashrate change, when it's always noisy, the new difficulty overshoots the true value — Can be as bad as ~30% overshoot during cold-start.
Tightening and loosening have asymmetric costs. Increasing difficulty rejects in-flight shares (the miner already did work that no longer meets the threshold). Decreasing difficulty is free (old harder work is still valid). The old algorithm treats both directions identically.
The algorithm
The new vardiff decomposes the difficulty adjustment decision into three sequential stages:
After a fire, the estimator is notified (
on_fire) so it can adjust its internal state to account for the new difficulty level — this is how the algorithm preserves information across retargets rather than starting from scratch.Stage 1: EWMA Estimator
Instead of a raw cumulative average, we use an exponentially-weighted moving average with a 120-second time constant. Recent observations count more than old ones, so when hashrate genuinely changes, the old stable-period data fades out naturally rather than diluting the signal indefinitely. On fire, the EWMA rescales its internal rate by the difficulty ratio (rather than resetting to zero), preserving the smoothing history.
Stage 2: Adaptive Boundary
The decision threshold adapts based on the miner's configured share rate:
Stage 3: Accelerating Partial Retarget
Rather than jumping the full distance to the estimate, each fire moves only 20% toward it. If the algorithm keeps firing in the same direction, the step size ramps to 40% —converging faster when the signal is clearly real.
Results (1000 trials/cell, deterministic seeding)
Breaking changes
VardiffState(previously all-pub struct). Requireschannels_sv2major version bump.shares_since_last_updatefield semantics changed from "shares since last fire" to "shares since last evaluation tick" (the EWMA consumes and zeroes the counter each tick).Public constructor API (
new,new_with_min) andVardifftrait interface are unchanged.Test plan
cargo test -p channels_sv2— 14 vardiff property testscargo test --verbose— full workspace passescargo fmt --check— clean