feat(vardiff): replace threshold-ladder with adaptive EWMA algorithm by gimballock · Pull Request #2188 · stratum-mining/stratum

gimballock · 2026-06-09T15:21:07Z

Summary

Replace the threshold-ladder vardiff algorithm with an adaptive composition that addresses long-standing issues with difficulty adjustment quality.

This is the clean, production-ready extraction of the best-performing algorithm from the simulation framework in #2154. All test scaffolding, alternative implementations, and trait abstractions from that research branch have been removed — this PR is a single flat struct replacement of VardiffState.

Motivation

The existing vardiff was observed producing excessive jitter at moderate share rates and sluggish reaction to genuine hashrate changes at high share rates. @adammwest suggested Poisson confidence interval bounds as a statistically-grounded alternative to the fixed threshold ladder.

Investigation with the simulation framework (#2154) revealed that these issues can't be fixed by tuning the existing algorithm's parameters — they're limitations of its design:

Reactivity degrades with share rate. After the algorithm converges, it keeps averaging over an ever-growing window. When a miner's hashrate drops, the old stable data dilutes the new signal. At 60+ shares/min, only 9–16% of genuine -50% hashrate drops are detected within 5 minutes.
Jitter and reactivity are coupled. The fixed 15% threshold (used when difficulty hasn't changed for at least 5 minutes) is simultaneously too loose at high share rates (noise floor is only 3–9%, so the algorithm fires on random variance) and too tight at low share rates (genuine changes produce deviations below 15% because there just aren't enough shares to measure precisely).
Full retarget overshoots on noisy estimates. When the algorithm does fire, it jumps the full distance to its estimate - which is based on whatever shares happened to arrive since the last retarget, however noisy that sample is. During cold-start or right after a hashrate change, when it's always noisy, the new difficulty overshoots the true value — Can be as bad as ~30% overshoot during cold-start.
Tightening and loosening have asymmetric costs. Increasing difficulty rejects in-flight shares (the miner already did work that no longer meets the threshold). Decreasing difficulty is free (old harder work is still valid). The old algorithm treats both directions identically.

The algorithm

The new vardiff decomposes the difficulty adjustment decision into three sequential stages:

Estimator — "What is happening?" Converts raw share arrivals into a smoothed hashrate belief.
Boundary — "Should I act?" Decides whether the deviation from target is a real hashrate change or just random noise.
Update — "How much should I move?" Computes the new difficulty when the boundary says to fire.

After a fire, the estimator is notified (on_fire) so it can adjust its internal state to account for the new difficulty level — this is how the algorithm preserves information across retargets rather than starting from scratch.

Stage 1: EWMA Estimator

Instead of a raw cumulative average, we use an exponentially-weighted moving average with a 120-second time constant. Recent observations count more than old ones, so when hashrate genuinely changes, the old stable-period data fades out naturally rather than diluting the signal indefinitely. On fire, the EWMA rescales its internal rate by the difficulty ratio (rather than resetting to zero), preserving the smoothing history.

Stage 2: Adaptive Boundary

The decision threshold adapts based on the miner's configured share rate:

Below 10 shares/min: PoissonCI — wide statistical confidence interval that prevents premature fires when there aren't enough shares per tick for precise measurement.
At 10+ shares/min: AsymmetricCUSUM — tighter sequential-testing boundary that accumulates evidence across ticks, enabling faster reaction when share data is abundant.
Both boundaries apply asymmetric cost: 3x more evidence required to tighten difficulty than to loosen, reflecting that tightening rejects in-flight shares while loosening is free.

Stage 3: Accelerating Partial Retarget

Rather than jumping the full distance to the estimate, each fire moves only 20% toward it. If the algorithm keeps firing in the same direction, the step size ramps to 40% —converging faster when the signal is clearly real.

Results (1000 trials/cell, deterministic seeding)

Metric	Old (SPM 6)	New (SPM 6)	New (SPM 30)
Cold-start overshoot (p99)	28%	0%	0%
Jitter (mean fires/min)	0.033	0.003	0.002
Convergence time (p90)	9m	3m	6m
Detect -50% drop	87%	87%	100%
Detect -10% drop	29%	33%	50%
Detect +50% increase	10%	85%	79%
Transient disconnect recovery	Full cold-start ramp	1–2 fires	1–2 fires

Breaking changes

Adds private fields to VardiffState (previously all-pub struct). Requires channels_sv2 major version bump.
shares_since_last_update field semantics changed from "shares since last fire" to "shares since last evaluation tick" (the EWMA consumes and zeroes the counter each tick).

Public constructor API (new, new_with_min) and Vardiff trait interface are unchanged.

Test plan

cargo test -p channels_sv2 — 14 vardiff property tests
cargo test --verbose — full workspace passes
cargo fmt --check — clean
Deployed to testnet4 with live miners via sv2-apps

Clean extraction of the best-performing vardiff algorithm from the simulation framework in stratum-mining#2154, with all test scaffolding, traits, and alternative algorithm implementations removed. The previous VardiffState used a fixed time-dependent threshold ladder and full retarget. This produced: - 6.6% median settled error (p99: 30% at low SPM) - 5–9 minute cold-start convergence (p90) - 33% detection rate for 10% hashrate declines (thermal throttle, failing ASICs) - 28% target overshoot during cold-start ramp (p99 at SPM 6) The new algorithm (EWMA + adaptive boundary + accelerating partial retarget): - Settled accuracy: <3% median error across all SPM - Cold-start overshoot bounded to <10% (was 28%) - Jitter: 0.03 fires/min at low SPM (was 0.06) — half the unnecessary retargets - Small-change detection: 85% reaction to -10% steps at SPM 6 (was 33%) - Transient disconnects recover in 1–2 fires rather than requiring a full cold-start ramp (20%/fire partial retarget vs old algo's 50–67% slash) - Asymmetric cost: loosening fires 3x faster than tightening, because loosening is free but tightening rejects in-flight shares Breaking: adds private fields to VardiffState (previously all-pub). Requires channels_sv2 major version bump. Public constructor API (new, new_with_min) and Vardiff trait interface are unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vnprc · 2026-06-11T20:54:30Z

We discussed this PR at @TriangleBitDevs yesterday and I think we came away with some valuable feedback. Shoutout @gimballock for presenting your work!

Both boundaries apply asymmetric cost: 3x more evidence required to tighten difficulty than to loosen, reflecting that tightening rejects in-flight shares while loosening is free.

I don't believe the asymmetric boundary cost is a useful optimization. This seems to stem from a misunderstanding of how difficulty adjustments interact with existing jobs.

Difficulty updates do not invalidate existing jobs. Those jobs remain valid until an on_set_new_prev_hash message is received, as indicated by this code comment. In other words, the old difficulty target remains valid until a new block is found.

The lines of code that prove it:

Target snapshotted at job creation. self.target is captured into the map per job ID: standard extended
Share validated against the per-job snapshot, not current channel target. job_target from the map is what the share hash is checked against: standard extended
update_channel (vardiff path) only mutates self.target, never touches job_id_to_target, so existing jobs retain their original target: standard extended
The map is wiped on new block: standard extended

I think we also exposed a similar misunderstanding of how shares rejected due to low difficulty impact profitability of a miner. They don't.

Any shares rejected due to low difficulty do not represent lost value in aggregate because shares found at or above the new difficulty threshold are worth proportionally more. In other words, the shares that pass the new difficulty threshold make up for the lost value of any shares that do not pass the new threshold. Rejected shares are a usability problem because they seem to indicate an error to the human monitoring these metrics. Assumptions that these shares lead to lost value for the miner arise from misunderstanding the nuances of share value calculation.

gimballock · 2026-06-12T13:50:16Z

reflecting that tightening rejects in-flight shares while loosening is free.

You're right that the pool's job_id_to_target snapshot design prevents difficulty adjustments from rejecting in-flight shares — the "rejecting in-flight shares" framing in the docs is incorrect and we'll fix it. Good catch, and thanks for inviting me to speak at your BitDevs.

The asymmetric boundaries aren't justified by that rationale though — they're justified by our simulation framework results. We swept tighten_multiplier over [1.0, 1.25, 1.5, 2.0, 2.5, 3.0] across 1000 Monte Carlo trials per cell and found the multiplier 3.0 produced the highest fitness score.

What the asymmetry actually does: it suppresses false tightenings from lucky streaks, reducing steady-state jitter and preventing large upward difficulty jumps. The cost is slower convergence when tightening is genuinely needed (e.g. hardware upgrade).

One caveat we're transparent about: our fitness metric weights stability heavily (jitter + step_magnitude_safety = 50%) over reactivity (25%) and convergence (15%). Under more balanced weights (30/30/30/10), the convergence penalty gets amplified 2x while the stability gains get discounted — a lower multiplier like 1.5–2.0 would likely rank higher, producing faster adaptation to genuine hashrate increases at the cost of more false fires and higher jitter on transient spikes.

So the 3.0 constant is optimal given our prioritization, not in any absolute sense - it optimizes for "never surprise the miner" over "track hashrate changes quickly." We're open to discussing whether the weighting reflects the right tradeoffs for the upstream project, and whether a lower multiplier (with its faster convergence but higher churn) better serves the broader community of pool operators.

gimballock force-pushed the feat/vardiff-ewma-adaptive branch from a1871c7 to 9308b4d Compare June 10, 2026 17:11

gimballock force-pushed the feat/vardiff-ewma-adaptive branch from 9308b4d to c6d1d7d Compare June 10, 2026 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vardiff): replace threshold-ladder with adaptive EWMA algorithm#2188

feat(vardiff): replace threshold-ladder with adaptive EWMA algorithm#2188
gimballock wants to merge 1 commit into
stratum-mining:mainfrom
marafoundation:feat/vardiff-ewma-adaptive

gimballock commented Jun 9, 2026 •

edited

Loading

Uh oh!

vnprc commented Jun 11, 2026

Uh oh!

gimballock commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gimballock commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

The algorithm

Stage 1: EWMA Estimator

Stage 2: Adaptive Boundary

Stage 3: Accelerating Partial Retarget

Results (1000 trials/cell, deterministic seeding)

Breaking changes

Test plan

Uh oh!

vnprc commented Jun 11, 2026

Uh oh!

gimballock commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gimballock commented Jun 9, 2026 •

edited

Loading