Skip to content

feat(vardiff): replace threshold-ladder with adaptive EWMA algorithm#2188

Open
gimballock wants to merge 1 commit into
stratum-mining:mainfrom
marafoundation:feat/vardiff-ewma-adaptive
Open

feat(vardiff): replace threshold-ladder with adaptive EWMA algorithm#2188
gimballock wants to merge 1 commit into
stratum-mining:mainfrom
marafoundation:feat/vardiff-ewma-adaptive

Conversation

@gimballock

@gimballock gimballock commented Jun 9, 2026

Copy link
Copy Markdown

Summary

Replace the threshold-ladder vardiff algorithm with an adaptive composition that addresses long-standing issues with difficulty adjustment quality.

This is the clean, production-ready extraction of the best-performing algorithm from the simulation framework in #2154. All test scaffolding, alternative implementations, and trait abstractions from that research branch have been removed — this PR is a single flat struct replacement of VardiffState.

Motivation

The existing vardiff was observed producing excessive jitter at moderate share rates and sluggish reaction to genuine hashrate changes at high share rates. @adammwest suggested Poisson confidence interval bounds as a statistically-grounded alternative to the fixed threshold ladder.

Investigation with the simulation framework (#2154) revealed that these issues can't be fixed by tuning the existing algorithm's parameters — they're limitations of its design:

  1. Reactivity degrades with share rate. After the algorithm converges, it keeps averaging over an ever-growing window. When a miner's hashrate drops, the old stable data dilutes the new signal. At 60+ shares/min, only 9–16% of genuine -50% hashrate drops are detected within 5 minutes.

  2. Jitter and reactivity are coupled. The fixed 15% threshold (used when difficulty hasn't changed for at least 5 minutes) is simultaneously too loose at high share rates (noise floor is only 3–9%, so the algorithm fires on random variance) and too tight at low share rates (genuine changes produce deviations below 15% because there just aren't enough shares to measure precisely).

  3. Full retarget overshoots on noisy estimates. When the algorithm does fire, it jumps the full distance to its estimate - which is based on whatever shares happened to arrive since the last retarget, however noisy that sample is. During cold-start or right after a hashrate change, when it's always noisy, the new difficulty overshoots the true value — Can be as bad as ~30% overshoot during cold-start.

  4. Tightening and loosening have asymmetric costs. Increasing difficulty rejects in-flight shares (the miner already did work that no longer meets the threshold). Decreasing difficulty is free (old harder work is still valid). The old algorithm treats both directions identically.

The algorithm

The new vardiff decomposes the difficulty adjustment decision into three sequential stages:

  1. Estimator — "What is happening?" Converts raw share arrivals into a smoothed hashrate belief.
  2. Boundary — "Should I act?" Decides whether the deviation from target is a real hashrate change or just random noise.
  3. Update — "How much should I move?" Computes the new difficulty when the boundary says to fire.

After a fire, the estimator is notified (on_fire) so it can adjust its internal state to account for the new difficulty level — this is how the algorithm preserves information across retargets rather than starting from scratch.

Stage 1: EWMA Estimator

Instead of a raw cumulative average, we use an exponentially-weighted moving average with a 120-second time constant. Recent observations count more than old ones, so when hashrate genuinely changes, the old stable-period data fades out naturally rather than diluting the signal indefinitely. On fire, the EWMA rescales its internal rate by the difficulty ratio (rather than resetting to zero), preserving the smoothing history.

Stage 2: Adaptive Boundary

The decision threshold adapts based on the miner's configured share rate:

  • Below 10 shares/min: PoissonCI — wide statistical confidence interval that prevents premature fires when there aren't enough shares per tick for precise measurement.
  • At 10+ shares/min: AsymmetricCUSUM — tighter sequential-testing boundary that accumulates evidence across ticks, enabling faster reaction when share data is abundant.
  • Both boundaries apply asymmetric cost: 3x more evidence required to tighten difficulty than to loosen, reflecting that tightening rejects in-flight shares while loosening is free.

Stage 3: Accelerating Partial Retarget

Rather than jumping the full distance to the estimate, each fire moves only 20% toward it. If the algorithm keeps firing in the same direction, the step size ramps to 40% —converging faster when the signal is clearly real.

Results (1000 trials/cell, deterministic seeding)

Metric Old (SPM 6) New (SPM 6) New (SPM 30)
Cold-start overshoot (p99) 28% 0% 0%
Jitter (mean fires/min) 0.033 0.003 0.002
Convergence time (p90) 9m 3m 6m
Detect -50% drop 87% 87% 100%
Detect -10% drop 29% 33% 50%
Detect +50% increase 10% 85% 79%
Transient disconnect recovery Full cold-start ramp 1–2 fires 1–2 fires

Breaking changes

  • Adds private fields to VardiffState (previously all-pub struct). Requires channels_sv2 major version bump.
  • shares_since_last_update field semantics changed from "shares since last fire" to "shares since last evaluation tick" (the EWMA consumes and zeroes the counter each tick).

Public constructor API (new, new_with_min) and Vardiff trait interface are unchanged.

Test plan

  • cargo test -p channels_sv2 — 14 vardiff property tests
  • cargo test --verbose — full workspace passes
  • cargo fmt --check — clean
  • Deployed to testnet4 with live miners via sv2-apps

@gimballock gimballock force-pushed the feat/vardiff-ewma-adaptive branch from a1871c7 to 9308b4d Compare June 10, 2026 17:11
Clean extraction of the best-performing vardiff algorithm from the
simulation framework in stratum-mining#2154, with all test scaffolding, traits, and
alternative algorithm implementations removed.

The previous VardiffState used a fixed time-dependent threshold ladder
and full retarget. This produced:

- 6.6% median settled error (p99: 30% at low SPM)
- 5–9 minute cold-start convergence (p90)
- 33% detection rate for 10% hashrate declines (thermal throttle, failing ASICs)
- 28% target overshoot during cold-start ramp (p99 at SPM 6)

The new algorithm (EWMA + adaptive boundary + accelerating partial retarget):

- Settled accuracy: <3% median error across all SPM
- Cold-start overshoot bounded to <10% (was 28%)
- Jitter: 0.03 fires/min at low SPM (was 0.06) — half the unnecessary retargets
- Small-change detection: 85% reaction to -10% steps at SPM 6 (was 33%)
- Transient disconnects recover in 1–2 fires rather than requiring a full
  cold-start ramp (20%/fire partial retarget vs old algo's 50–67% slash)
- Asymmetric cost: loosening fires 3x faster than tightening, because
  loosening is free but tightening rejects in-flight shares

Breaking: adds private fields to VardiffState (previously all-pub).
Requires channels_sv2 major version bump. Public constructor API
(new, new_with_min) and Vardiff trait interface are unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gimballock gimballock force-pushed the feat/vardiff-ewma-adaptive branch from 9308b4d to c6d1d7d Compare June 10, 2026 21:05
@vnprc

vnprc commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

We discussed this PR at @TriangleBitDevs yesterday and I think we came away with some valuable feedback. Shoutout @gimballock for presenting your work!

Both boundaries apply asymmetric cost: 3x more evidence required to tighten difficulty than to loosen, reflecting that tightening rejects in-flight shares while loosening is free.

I don't believe the asymmetric boundary cost is a useful optimization. This seems to stem from a misunderstanding of how difficulty adjustments interact with existing jobs.

Difficulty updates do not invalidate existing jobs. Those jobs remain valid until an on_set_new_prev_hash message is received, as indicated by this code comment. In other words, the old difficulty target remains valid until a new block is found.

The lines of code that prove it:

  1. Target snapshotted at job creation. self.target is captured into the map per job ID: standard extended
  2. Share validated against the per-job snapshot, not current channel target. job_target from the map is what the share hash is checked against: standard extended
  3. update_channel (vardiff path) only mutates self.target, never touches job_id_to_target, so existing jobs retain their original target: standard extended
  4. The map is wiped on new block: standard extended

I think we also exposed a similar misunderstanding of how shares rejected due to low difficulty impact profitability of a miner. They don't.

Any shares rejected due to low difficulty do not represent lost value in aggregate because shares found at or above the new difficulty threshold are worth proportionally more. In other words, the shares that pass the new difficulty threshold make up for the lost value of any shares that do not pass the new threshold. Rejected shares are a usability problem because they seem to indicate an error to the human monitoring these metrics. Assumptions that these shares lead to lost value for the miner arise from misunderstanding the nuances of share value calculation.

@gimballock

Copy link
Copy Markdown
Author

reflecting that tightening rejects in-flight shares while loosening is free.

You're right that the pool's job_id_to_target snapshot design prevents difficulty adjustments from rejecting in-flight shares — the "rejecting in-flight shares" framing in the docs is incorrect and we'll fix it. Good catch, and thanks for inviting me to speak at your BitDevs.

The asymmetric boundaries aren't justified by that rationale though — they're justified by our simulation framework results. We swept tighten_multiplier over [1.0, 1.25, 1.5, 2.0, 2.5, 3.0] across 1000 Monte Carlo trials per cell and found the multiplier 3.0 produced the highest fitness score.

What the asymmetry actually does: it suppresses false tightenings from lucky streaks, reducing steady-state jitter and preventing large upward difficulty jumps. The cost is slower convergence when tightening is genuinely needed (e.g. hardware upgrade).

One caveat we're transparent about: our fitness metric weights stability heavily (jitter + step_magnitude_safety = 50%) over reactivity (25%) and convergence (15%). Under more balanced weights (30/30/30/10), the convergence penalty gets amplified 2x while the stability gains get discounted — a lower multiplier like 1.5–2.0 would likely rank higher, producing faster adaptation to genuine hashrate increases at the cost of more false fires and higher jitter on transient spikes.

So the 3.0 constant is optimal given our prioritization, not in any absolute sense - it optimizes for "never surprise the miner" over "track hashrate changes quickly." We're open to discussing whether the weighting reflects the right tradeoffs for the upstream project, and whether a lower multiplier (with its faster convergence but higher churn) better serves the broader community of pool operators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants