fix(p2p): prevent underflow panic in gossip rate limiter timestamp math#1445
Merged
Merged
Conversation
The DataRequestTracker stamps per-peer timestamps from the non-monotonic
wall clock (SystemTime) and computed elapsed times by unsigned
subtraction. When a stored timestamp lands after the captured "now" —
either a concurrent insert during DashMap iteration in cleanup, or a
backward wall-clock step — the subtraction underflows. With
overflow-checks = true in the release profile this is a hard panic,
observed killing mainnet-node-1 on 2026-06-10 via POST /gossip/pull_data.
Use saturating_sub at all four subtraction sites so an out-of-order
timestamp reads as elapsed 0 ("just happened"), which is the safe and
intended behavior at each call site. Adds regression tests that panic
on the unfixed code.
Contributor
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughRate-limiting code in ChangesClock-skew defensive timing
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Contributor
JesseTheRobot
added a commit
that referenced
this pull request
Jun 11, 2026
…th (#1445) The DataRequestTracker stamps per-peer timestamps from the non-monotonic wall clock (SystemTime) and computed elapsed times by unsigned subtraction. When a stored timestamp lands after the captured "now" — either a concurrent insert during DashMap iteration in cleanup, or a backward wall-clock step — the subtraction underflows. With overflow-checks = true in the release profile this is a hard panic, observed killing mainnet-node-1 on 2026-06-10 via POST /gossip/pull_data. Use saturating_sub at all four subtraction sites so an out-of-order timestamp reads as elapsed 0 ("just happened"), which is the safe and intended behavior at each call site. Adds regression tests that panic on the unfixed code.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What happened
mainnet-node-1 (genesis) panicked on 2026-06-10 11:34:30 UTC (running
release/mainnet/3.0.2 @ e30f336af) and was restarted by pm2:Reached from inbound gossip traffic:
POST /gossip/pull_data→handle_get_data_sync→DataRequestTracker::check_request→cleanup_expired_entries.Root cause
The tracker stamps per-peer timestamps (
window_start,last_request,last_cleanup) from the non-monotonic wall clock (SystemTime::now()as epoch millis) and computes elapsed time by unsigned subtraction (now - stored). Because the workspace setsoverflow-checks = truein[profile.release], an underflow is a hard panic in every shipped profile.A stored timestamp can exceed the captured
nowvia either mechanism:cleanup_expired_entriescapturesnow, then iterates theDashMap(not a snapshot). A concurrentcheck_requestinserts a fresh record stamped a moment after the capturednow; the iterator visits it andnow - window_startunderflows. Rare per-event, inevitable over time on a busy node.Fix
saturating_subat all four subtraction sites (is_window_expired,is_duplicate_request,cleanup_expired_entries,cleanup_if_needed). An out-of-order timestamp now reads as elapsed0— "just happened / not yet aged / not expired" — which is the safe, intended behavior at each call site. No rate-limiting thresholds, windows, or signatures change.Regression tests were written first and confirmed to panic with
attempt to subtract with overflowat the exact production lines on the unfixed code; they pass after the fix.Note: this is a surgical hotfix — the proper fix is PR #1402
#1402 (
fix: clock drift/monotonic timers) is the proper fix: it rewrites this tracker to use the monotonicstd::time::Instant(whose arithmetic saturates by design), eliminating the wall-clock dependency entirely. That PR is open againstmasterand should be revived and landed there. This PR is the minimal, low-risk hotfix appropriate for the release line so deployed nodes stop panicking now; #1402 supersedes it onmasterwhen it lands.Per the shared-code hotfix process, this fix originates on
release/3.xand must be cherry-picked up tomasterafter landing (or arrive there via #1402's rewrite).Verification
cargo test -p irys-p2p --lib rate_limiting— 9/9 pass on this branch (new tests red before the fix)cargo build -p irys-p2p --profile debug-release— the overflow-checks profile builds cleancargo fmt/cargo clippy -p irys-p2p --tests --all-targets— clean🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests