Skip to content

fix: block pool race#1417

Merged
JesseTheRobot merged 11 commits into
masterfrom
fix/block-pool-race
May 14, 2026
Merged

fix: block pool race#1417
JesseTheRobot merged 11 commits into
masterfrom
fix/block-pool-race

Conversation

@JesseTheRobot

@JesseTheRobot JesseTheRobot commented May 13, 2026

Copy link
Copy Markdown
Member

Describe the changes
Adds a new BlockStatus::InTreePendingValidation variant, which allows the node to distinguish between blocks in the tree (ProcessedButCanBeReorganized) and those in the tree, but that haven't been validated yet (the new InTreePendingValidation) - this allows us to prevent a second-order orphan child "storm" if hte parent fails validation and/or is evicted from the block tree.

Related Issue(s)
Fixes the orphan child block re-process storm when the parent is evicted from the tree.

Checklist

  • Tests have been added/updated for the changes.
  • Documentation has been updated for the changes (if applicable).
  • The code follows Rust's style guidelines.

Summary by CodeRabbit

  • Bug Fixes

    • Prevent redundant re-processing and gossip for blocks that are present in-memory but still awaiting validation; refine block-status handling.
  • Tests

    • Reworked and made deterministic a VDF validation test for stalled-peer scenarios.
    • Added tests covering in-tree pending-validation semantics and block-pool behavior.
    • Improved a proptest to assert effective URL port handling.
  • Chores

    • Enabled additional test utilities for integration tests and added a proptest regression seed file.

Review Change Stack

@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Adds an InTreePendingValidation block-status, updates BlockPool to treat in-tree-but-unvalidated blocks as already handled, extends unit and integration tests (including a rewritten VDF stalled-progress integration test), and updates api-client proptest artifacts and assertions.

Changes

InTreePendingValidation handling and test updates

Layer / File(s) Summary
BlockStatus classification
crates/p2p/src/block_status_provider.rs
Adds the InTreePendingValidation variant, is_in_tree(), refactors block_status to use tree lookup and ChainState/BlockState matching, adds add_block_mock_with_state test helper, and extends unit tests.
BlockPool handling
crates/p2p/src/block_pool.rs
Updates process_block routing to use is_in_tree(), treats InTreePendingValidation as AlreadyProcessed in check_block_status, and expands the processed predicate to include in-tree statuses.
BlockPool tests
crates/p2p/src/tests/block_pool/mod.rs
Adds build_test_pool helper and three async tests verifying orphan re-pull avoidance, AlreadyProcessed rejection, and that in-tree statuses are treated as processed.
VDF stalled-progress integration test
crates/chain-tests/src/multi_node/vdf_validation_progress.rs, crates/chain-tests/Cargo.toml
Re-enables and rewrites the stalled-peer VDF test to freeze peer progress, inject a parent into the peer's tree, deliver a head via genesis.send_full_block, use a deadline-bounded polling loop for VDF failure detection, assert no local VDF step change, and enable irys-domain test-utils.
api-client proptest updates
crates/api-client/src/lib.rs, crates/api-client/proptest-regressions/lib.txt
Updates URL port assertion to use port_or_known_default() and adds a proptest regression seed file.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • Irys-xyz/irys#1413: Shares the stalled VDF test path and related failure expectations in crates/chain-tests/src/multi_node/vdf_validation_progress.rs.

Suggested reviewers

  • antouhou
  • DanMacDonald
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: block pool race' directly relates to the main change: preventing orphan re-process storms in the block pool by adding InTreePendingValidation status.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/block-pool-race

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/p2p/src/block_pool.rs`:
- Around line 1240-1283: The wait_for_parent_validation loop can block
indefinitely; update the async fn wait_for_parent_validation to use a timeout
when awaiting rx.recv() (e.g., via tokio::time::timeout) and decide a bounded
policy: on timeout call return ParentValidationOutcome::Invalid (conservative)
or log a warning and continue the loop (optimistic) — ensure you reference
service_senders.subscribe_block_state_updates() and
block_status_provider.block_status(parent_height, &parent_hash) behavior so you
only apply the timeout to the recv await, handle timeout results alongside
broadcast::error::RecvError::Lagged/Closed, and keep the function's return type
ParentValidationOutcome consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 01a403ea-8874-4a98-9d36-92ec916ed235

📥 Commits

Reviewing files that changed from the base of the PR and between 39a69b1 and 323fbab.

📒 Files selected for processing (5)
  • crates/chain-tests/Cargo.toml
  • crates/chain-tests/src/multi_node/vdf_validation_progress.rs
  • crates/p2p/src/block_pool.rs
  • crates/p2p/src/block_status_provider.rs
  • crates/p2p/src/tests/block_pool/mod.rs

Comment thread crates/p2p/src/block_pool.rs Outdated
@JesseTheRobot JesseTheRobot force-pushed the fix/block-pool-race branch from 323fbab to bab11b3 Compare May 13, 2026 13:56
@github-actions

Copy link
Copy Markdown
Contributor

…-pull

The old `block_status_provider` collapsed every in-tree block into
`ProcessedButCanBeReorganized`, so `block_pool::process_block` could
not tell "parent in tree, validation still pending" from "parent in
tree, validated". Descendants of an unvalidated parent fell through to
`block_discovery.handle_block`, which added them to the tree, scheduled
their validation, and left the parent-wait to `validation_service`'s
`wait_for_parent_validation`. If the parent then failed validation it
was removed from the tree, and every descendant already added
cascade-failed with `Parent block not found` → `Cancelled` → invalid →
also removed. On the testnet `GogWkmVot…` divergence this produced
628 orphan events / 354 redundant validation cycles for a single
not-yet-validated parent.

Surface the parent's `ChainState` through `BlockStatus`:

- `BlockStatus::InTreePendingValidation` for `ChainState::NotOnchain`
  /`Validated` paired with `Unknown` or `ValidationScheduled`.
- `is_in_tree()` predicate covering `InTreePendingValidation |
  ProcessedButCanBeReorganized | Finalized`. `is_processed()` stays
  strict (only the validated/finalized/pruned-fork set).
- `block_status()` reads `BlockTree::get_block_and_status` and maps
  `ChainState` into the new variants.

In `block_pool::process_block`:

- Gate the orphan re-pull branch on `!is_in_tree()` instead of
  `!is_processed()`. The orphan branch now only fires when the parent
  is genuinely missing.
- When the parent is `InTreePendingValidation`, hold the child via a
  new `wait_for_parent_validation` helper. The helper subscribes to
  `service_senders.block_state_events` *before* re-reading parent
  state (so a transition between the initial check and the subscribe
  is not lost), tolerates `Lagged` events by re-reading from the
  tree, and treats `Closed` as `Invalid`.
- On `Valid` the child resumes the normal `process_block` path. On
  `Invalid` the child is removed from `blocks_cache` with a new
  `FailureReason::ParentValidationFailed` and the sync state records
  the failure. The descendant never reaches the tree, so a failed
  parent does not cascade through its children.
- `check_block_status` adds an `InTreePendingValidation` arm returning
  `Advisory::AlreadyProcessed` (gossip handlers must not re-enter
  `process_block` for an already-in-tree block).
- `is_block_processing_or_processed` includes `is_in_tree()` so gossip
  handlers also skip blocks that are in the tree but still pending.

Inline unit tests in `block_status_provider` cover the three status
transitions (in-tree pending → validated → missing).
Four functional tests against the real `BlockPool` (with a mocked
block_discovery / mempool / service_senders) plus two helpers:

- `process_block_does_not_re_pull_parent_in_tree_pending_validation`:
  adds a parent to the mock tree (default state = InTreePendingValidation),
  spawns `process_block(child)`, asserts no orphan-cascade message
  (`RequestBlockFromTheNetwork` / `AttemptReprocessingBlock`) reaches
  the sync channel during the wait and that the child has not fallen
  through to block_discovery. Then fires a `BlockStateUpdated` event
  promoting the parent to `Validated(ValidBlock)`, verifies the
  spawned future resolves with `Processed`, and that the child finally
  reaches block_discovery.
- `process_block_drops_child_when_pending_parent_validation_fails`:
  removes the parent from the tree mid-flight and broadcasts a
  discarded `BlockStateUpdated`, asserts `process_block` returns
  `BlockError(... failed validation ...)`, child is not in
  block_discovery, child is not in the block_pool cache, and no
  orphan-cascade message was emitted.
- `process_block_rejects_block_already_in_tree_pending_validation`:
  exercises `check_block_status`'s new arm — a process_block call for
  a block already in the tree as InTreePendingValidation returns
  `Advisory::AlreadyProcessed` and does not re-enter block_discovery.
- `is_block_processing_or_processed_true_for_in_tree_pending_validation`:
  exercises the gossip-handler gate — before adding to the tree the
  predicate is false; after, it is true.

`build_test_pool` collapses the 8-line per-test setup, and
`is_orphan_cascade_message` is the narrow predicate the wait-path
tests use against the sync channel (allowing the expected
`BlockProcessedByThePool` notification to pass through).
…ction

`heavy_test_vdf_progress_check_fails_stalled_peer` was added in
e9bca30 as `#[ignore]`d documentation, pending a way to deliver a
block to the peer whose parent is in the tree but whose VDF steps
were never fast-forwarded into the peer's `vdf_state`. The original
"gossip the head only" setup cannot exercise the progress check on
its own: when the head's parent is `NotProcessed`, `block_pool` runs
the orphan-fetch cascade, each fetched ancestor's
`ensure_vdf_is_valid` calls `fast_forward_validated_steps`, and by the
time the head reaches VDF validation the gap has been bridged.

Rework the test to use direct tree injection:

- Enable `irys-domain`'s `test-utils` feature in `chain-tests/Cargo.toml`
  so `BlockTreeReadGuard::write()` is available outside `#[cfg(test)]`.
- Mine the head's parent privately on genesis, then drop it straight
  into peer's `block_tree` via `add_block` + `mark_block_as_validation_scheduled`
  + `mark_block_as_valid`. Snapshots are inherited from the grandparent
  (peer's height-2 tip) — valid for a no-commitment, non-epoch block.
  Crucially, this skips block_discovery / block_tree_service /
  validation_service entirely for the parent, so its VDF steps are
  never fast-forwarded into peer's `vdf_state`.
- Mine the head privately, then deliver it via `send_full_block`. That
  call bypasses `block_pool` and the orphan cascade; block_discovery
  finds the parent already in the tree, schedules `ValidateBlock` for
  the head, and `ensure_vdf_is_valid` enters Stage A
  (`wait_for_step(prev_output_step)`). Peer cannot reach that step
  (mining is stopped, no fast-forward source for the intermediate
  steps), and after `progress_timeout_secs` the wait bails with
  `WaitForStepError::Stalled`.

Update the assertion to match the current consensus-safety contract:
a `Stalled` wait must panic per the never-mislabel rule (see
`active_validations.rs:150` and `design/docs/vdf-validation-stall-detection.md`).
The cleanest in-test signal is the validation_service task's `mpsc`
receiver being dropped when it unwinds — poll
`service_senders.validation_service.is_closed()`. The test also
subscribes to `block_state_events` as a regression detector: a `Valid`
event for the head would mean fast-forward bridged the gap somehow,
and is treated as a loud test failure. As defense-in-depth, the test
also asserts peer's `global_step` did not advance during the wait.

Stable: 5/5 passes in ~9s each.
@JesseTheRobot JesseTheRobot force-pushed the fix/block-pool-race branch from 7a36fdd to db79f11 Compare May 14, 2026 08:31

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
crates/p2p/src/block_pool.rs (1)

1262-1305: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bound the parent-validation wait.

This loop still has no timeout. If the parent stays stuck in InTreePendingValidation, every descendant remains is_processing = true forever and accumulates in the pool behind it.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/p2p/src/block_pool.rs` around lines 1262 - 1305, The
wait_for_parent_validation loop in async fn wait_for_parent_validation can block
forever when a parent remains InTreePendingValidation; wrap the receive/loop in
a bounded wait using tokio::time::timeout (or equivalent) with a configured
duration constant (e.g. MAX_PARENT_VALIDATION_WAIT) and on timeout log a warning
including parent_hash and return ParentValidationOutcome::Invalid (or a new
Timeout variant if you prefer); ensure you still subscribe via
service_senders.subscribe_block_state_updates() and re-check
block_status(parent_height, &parent_hash) after each successful recv, and treat
timeout like a failure to avoid descendants being stuck in is_processing
forever.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/p2p/src/block_pool.rs`:
- Around line 128-136: ParentValidationOutcome currently conflates
peer-rejection and local failures; split it into two variants (e.g.,
InvalidRejected and InvalidLocal) in the ParentValidationOutcome enum and adjust
the wait/handler logic that matches on ParentValidationOutcome (the match site
that currently converts every Invalid into a `"parent ... failed validation"`
BlockError) so that only InvalidRejected is mapped to a BlockError and penalizes
the peer, while InvalidLocal (e.g., event channel closed, shutdown, cache/prune)
is handled locally without attributing a BlockError to the peer. Update all
match arms/usages of ParentValidationOutcome (the places around the existing
Invalid -> BlockError conversion and the other handling sites referenced in the
review) to branch on the new variants accordingly.

In `@crates/p2p/src/tests/block_pool/mod.rs`:
- Around line 979-980: Replace the plain #[tokio::test] on async tests with the
tracing-enabled attribute #[test_log::test(tokio::test)]; specifically change
the test declaration for
process_block_does_not_re_pull_parent_in_tree_pending_validation to use
#[test_log::test(tokio::test)] instead of #[tokio::test], and make the same
replacement for the other async tests in this file that currently use
#[tokio::test] so they retain the tracing harness during execution.

---

Duplicate comments:
In `@crates/p2p/src/block_pool.rs`:
- Around line 1262-1305: The wait_for_parent_validation loop in async fn
wait_for_parent_validation can block forever when a parent remains
InTreePendingValidation; wrap the receive/loop in a bounded wait using
tokio::time::timeout (or equivalent) with a configured duration constant (e.g.
MAX_PARENT_VALIDATION_WAIT) and on timeout log a warning including parent_hash
and return ParentValidationOutcome::Invalid (or a new Timeout variant if you
prefer); ensure you still subscribe via
service_senders.subscribe_block_state_updates() and re-check
block_status(parent_height, &parent_hash) after each successful recv, and treat
timeout like a failure to avoid descendants being stuck in is_processing
forever.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 81dce36f-0b60-452c-b880-66b64ae49904

📥 Commits

Reviewing files that changed from the base of the PR and between 7a36fdd and db79f11.

📒 Files selected for processing (7)
  • crates/api-client/proptest-regressions/lib.txt
  • crates/api-client/src/lib.rs
  • crates/chain-tests/Cargo.toml
  • crates/chain-tests/src/multi_node/vdf_validation_progress.rs
  • crates/p2p/src/block_pool.rs
  • crates/p2p/src/block_status_provider.rs
  • crates/p2p/src/tests/block_pool/mod.rs

Comment thread crates/p2p/src/block_pool.rs Outdated
Comment thread crates/p2p/src/tests/block_pool/mod.rs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/p2p/src/block_pool.rs`:
- Line 1327: Replace the unchecked runtime cast for waited_ms (currently using
started.elapsed().as_millis() as u64) with a checked conversion: call
u64::try_from(started.elapsed().as_millis()) (or
started.elapsed().as_millis().try_into()) and handle the Result to avoid
lossy/undefined behavior—e.g., on Err either cap to u64::MAX or return/log an
error—so replace the direct `as u64` on the started.elapsed().as_millis()
expression with a safe try-into conversion and explicit error/overflow handling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4a3e0b05-9069-47e1-9cac-215f5fd52dec

📥 Commits

Reviewing files that changed from the base of the PR and between db79f11 and 62a4cf3.

📒 Files selected for processing (2)
  • crates/p2p/src/block_pool.rs
  • crates/p2p/src/metrics.rs

Comment thread crates/p2p/src/block_pool.rs Outdated
… path

Fills two of the four gaps called out in review.md P2:

- Two unit tests for `block_status`: locally-produced blocks inserted as
  `Validated(Unknown)` and `Validated(ValidationScheduled)` must map to
  `InTreePendingValidation`, the new branch added in b0b8db8.
- One integration test that deterministically drives
  `broadcast::RecvError::Lagged` inside `wait_for_parent_validation` by
  burst-sending 201 events synchronously into the 100-cap channel before
  promoting the parent to valid — verifies the wait loop re-reads tree
  state on lag and still exits cleanly.

Adds `BlockStatusProvider::add_block_mock_with_state` test helper that
wraps `BlockTree::add_common` so tests can plant arbitrary `ChainState`.

Two gaps deferred: `RecvError::Closed` (Sender lives inside Arc held by
the running BlockPool; not testable as black-box without DI refactor)
and the subscribe-vs-initial-check race window (µs-scale; safety is
structural ordering in source).
Replaces "Fix 2" / "commit b0b8db8" / numbered-issue references with
self-contained summaries describing what the code actually does. The
external post-mortem doc those numbers were keyed to lives outside this
repo, so the citations would rot. Comment-only; no behavior change.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
crates/p2p/src/block_pool.rs (1)

133-142: ⚠️ Potential issue | 🟠 Major

Still applies: split local wait failures from parent-rejection outcomes.

Line 1313 still turns RecvError::Closed into ParentValidationOutcome::Invalid, and Lines 879-885 still translate every Invalid into a BlockError that says the parent failed validation. That continues to misattribute local shutdown/validation-service death to the peer that sent the child.

Also applies to: 866-885, 1313-1318

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/p2p/src/block_pool.rs` around lines 133 - 142, The code currently maps
RecvError::Closed to ParentValidationOutcome::Invalid and then treats every
Invalid as a parent-validation failure when creating a BlockError, which
misattributes local shutdowns to the peer; change ParentValidationOutcome (add a
new variant like WaitFailed or ChannelClosed) so RecvError::Closed maps to that
new variant instead of Invalid, and update all match sites that convert
ParentValidationOutcome::Invalid into a BlockError (the mapping around
BlockError creation) to only produce the “parent failed validation” BlockError
for the genuine Invalid variant while producing a different handling/error (or
silently drop) for the new WaitFailed/ChannelClosed variant so local
validation-service shutdowns aren’t blamed on the sender. Ensure you update all
uses of ParentValidationOutcome (including InTreePendingValidation wait logic
and the block-rejection path) to handle the new variant.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/chain-tests/src/multi_node/vdf_validation_progress.rs`:
- Around line 249-301: The loop currently treats any closure of
validation_service as a success even if we never observed a head-specific event;
add a guard to ensure the service-close path only counts after we've seen an
event for head_hash. Introduce a boolean (e.g., saw_head_update) initialized
false, set it true whenever an event with event.block_hash == head_hash is
processed (before handling ValidationResult cases), and change the closure check
to: if peer.node_ctx.service_senders.validation_service.is_closed() &&
saw_head_update { peer_validation_dead = true; break; } so unrelated prior
panics won't make the test pass.

---

Duplicate comments:
In `@crates/p2p/src/block_pool.rs`:
- Around line 133-142: The code currently maps RecvError::Closed to
ParentValidationOutcome::Invalid and then treats every Invalid as a
parent-validation failure when creating a BlockError, which misattributes local
shutdowns to the peer; change ParentValidationOutcome (add a new variant like
WaitFailed or ChannelClosed) so RecvError::Closed maps to that new variant
instead of Invalid, and update all match sites that convert
ParentValidationOutcome::Invalid into a BlockError (the mapping around
BlockError creation) to only produce the “parent failed validation” BlockError
for the genuine Invalid variant while producing a different handling/error (or
silently drop) for the new WaitFailed/ChannelClosed variant so local
validation-service shutdowns aren’t blamed on the sender. Ensure you update all
uses of ParentValidationOutcome (including InTreePendingValidation wait logic
and the block-rejection path) to handle the new variant.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b80f2ee7-f968-45bc-a929-ed7604e09562

📥 Commits

Reviewing files that changed from the base of the PR and between db79f11 and 7a2ec4e.

📒 Files selected for processing (5)
  • crates/chain-tests/src/multi_node/vdf_validation_progress.rs
  • crates/p2p/src/block_pool.rs
  • crates/p2p/src/block_status_provider.rs
  • crates/p2p/src/metrics.rs
  • crates/p2p/src/tests/block_pool/mod.rs

Comment thread crates/chain-tests/src/multi_node/vdf_validation_progress.rs
@coderabbitai

coderabbitai Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

Caution

Review failed

An error occurred during the review process. Please try again later.

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • ✅ Review completed - (🔄 Check again to review again)
📝 Walkthrough

Walkthrough

Adds an InTreePendingValidation BlockStatus, refactors block-status mapping, makes BlockPool wait for parent validation (dropping children on invalid outcome), records a parent-wait metric, adds unit and integration tests, and enables test-utils for chain-tests.

Changes

InTreePendingValidation Orphan-Cascade Prevention

Layer / File(s) Summary
BlockStatus state detection and InTreePendingValidation
crates/p2p/src/block_status_provider.rs
BlockStatus now includes InTreePendingValidation; adds is_in_tree(); block_status refactored to use get_block_and_status + ChainState matching to distinguish in-tree-but-unvalidated blocks from processed/reorg-eligible blocks; unit tests added.
Block pool parent-validation waiting and child processing
crates/p2p/src/block_pool.rs, crates/p2p/src/metrics.rs
Adds ParentValidationOutcome, wait_for_parent_validation (subscribes to block_state_events and re-checks parent status), handles InTreePendingValidation in process_block by waiting for parent and dropping child with FailureReason::ParentValidationFailed on invalid outcome; updates gating (is_block_processing_or_processed, check_block_status), warning/log intervals, and records PARENT_WAIT_DURATION_MS.
Block pool InTreePendingValidation test suite
crates/p2p/src/tests/block_pool/mod.rs
Adds build_test_pool and is_orphan_cascade_message helpers and async tests verifying: no orphan-cascade re-pull while parent pending, child removal on parent validation failure, AlreadyProcessed advisory for in-tree pending blocks, and is_block_processing_or_processed true for in-tree pending blocks.
VDF validation progress integration test rewrite
crates/chain-tests/Cargo.toml, crates/chain-tests/src/multi_node/vdf_validation_progress.rs
Re-enables and rewrites the integration test to freeze peer VDF progress, inject a validated parent into the peer's block_tree, deliver the head via send_full_block, poll for stalled validation (service closure or Invalid outcome), assert no local VDF step advancement, and enable irys-domain test-utils feature.
Misc: api-client test and proptest seed
crates/api-client/src/lib.rs, crates/api-client/proptest-regressions/lib.txt
Use Url::port_or_known_default() in a proptest; add a proptest regression seed file with one saved seed.

Sequence Diagram(s)

sequenceDiagram
  participant ComponentA
  participant ComponentB
  ComponentA->>ComponentB: observable interaction
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Irys-xyz/irys#1413: Touches the same VDF stall-detection area and similarly adjusts validation-service stall handling; related to the integration test update.

Suggested reviewers

  • DanMacDonald
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'fix: block pool race' is highly generic and vague, lacking specificity about what race condition is being addressed or how it's fixed. Provide a more descriptive title that clarifies the specific race condition being fixed, such as 'fix: prevent orphan-cascade race in block pool parent validation' or similar.
✅ Passed checks (4 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/block-pool-race

Comment @coderabbitai help to get the list of available commands and usage tips.

@JesseTheRobot

Copy link
Copy Markdown
Member Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor
✅ Actions performed

Full review triggered.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
crates/chain-tests/src/multi_node/vdf_validation_progress.rs (1)

255-294: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Require a head-specific signal before treating service shutdown as success.

is_closed() on Line 291 can still make this test pass after an unrelated prior validation_service failure, even if head_hash never entered the path under test. Track whether you've observed a state update for head_hash first, and only accept the closed-service branch after that.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/chain-tests/src/multi_node/vdf_validation_progress.rs` around lines
255 - 294, The test currently treats validation_service.is_closed() as success
even if we've never observed any event for the target head_hash; add a boolean
flag (e.g., seen_head_event) initialized false before the loop, set it true when
you handle an event with event.block_hash == head_hash (the branch that logs and
inspects event.validation_result from event_rx), and change the closed-service
check (peer.node_ctx.service_senders.validation_service.is_closed()) to only set
peer_validation_dead and break if seen_head_event is true; keep all other logic
around event_rx, head_hash, peer_validation_dead, and ValidationResult
unchanged.
crates/p2p/src/tests/block_pool/mod.rs (1)

982-983: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Switch these async tests to test_log.

These new tests drive tracing-heavy block-pool paths, so plain #[tokio::test] drops the harness this repo expects for debugging flakes. Use #[test_log::test(tokio::test)] on all five.

As per coding guidelines, **/*.rs: Use #[test_log::test(tokio::test)] for async tests with tracing output.

Also applies to: 1075-1076, 1162-1163, 1204-1205, 1298-1299

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/p2p/src/tests/block_pool/mod.rs` around lines 982 - 983, Replace the
plain async test attribute with the tracing-enabled harness: change
#[tokio::test] to #[test_log::test(tokio::test)] on the async test function
process_block_does_not_re_pull_parent_in_tree_pending_validation and the four
other async tests referenced (the ones at the other noted locations). Ensure
each test function retains its async signature but uses
#[test_log::test(tokio::test)] so tracing is captured by the repository's test
harness.
crates/p2p/src/block_pool.rs (2)

1324-1328: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Use a checked conversion for waited_ms.

started.elapsed().as_millis() is a runtime u128; the as u64 cast on Line 1327 bypasses the codebase's checked-conversion rule for production Rust. Use u64::try_from(...) and clamp or handle overflow explicitly.

Based on learnings, in this codebase’s production Rust code, do not use as numeric casts for runtime numeric conversion; prefer try_from/try_into with explicit overflow handling.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/p2p/src/block_pool.rs` around lines 1324 - 1328, The warn! call that
logs waited_ms currently uses a direct cast from started.elapsed().as_millis()
as u64 which bypasses checked conversions; replace that with a fallible
conversion such as u64::try_from(started.elapsed().as_millis()) and handle
overflow explicitly (e.g., clamp to u64::MAX or map Err to a sentinel value)
before passing waited_ms into the warn! macro, keeping the same metadata keys
(parent.hash / parent.height / waited_ms) so the log call in the BlockPool code
uses a checked conversion instead of an `as` cast.

136-141: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Split parent rejection from local wait failures.

ParentValidationOutcome::Invalid currently bundles actual parent rejection together with local failure modes like block_state_events closing. Lines 866-885 then convert every one of those into "parent ... failed validation" BlockError, which misattributes local shutdown/cache lifecycle failures to the peer.

Also applies to: 866-885, 1313-1318

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/p2p/src/block_pool.rs` around lines 136 - 141, The
ParentValidationOutcome enum conflates peer-side rejections and local
failures—split Invalid into two variants (e.g., Rejected and LocalFailure) on
ParentValidationOutcome, update the code paths that produce the outcome
(including where block_state_events closing is detected) to emit the correct
variant, and then change the downstream handling that currently converts every
Invalid into a "parent ... failed validation" BlockError so that Rejected maps
to the existing BlockError::ParentFailedValidation (or equivalent) while
LocalFailure is either propagated as an internal/shutdown/cache lifecycle error
or handled separately (do this for both places noted where Invalid is converted
into the parent-failed BlockError).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@crates/chain-tests/src/multi_node/vdf_validation_progress.rs`:
- Around line 255-294: The test currently treats validation_service.is_closed()
as success even if we've never observed any event for the target head_hash; add
a boolean flag (e.g., seen_head_event) initialized false before the loop, set it
true when you handle an event with event.block_hash == head_hash (the branch
that logs and inspects event.validation_result from event_rx), and change the
closed-service check
(peer.node_ctx.service_senders.validation_service.is_closed()) to only set
peer_validation_dead and break if seen_head_event is true; keep all other logic
around event_rx, head_hash, peer_validation_dead, and ValidationResult
unchanged.

In `@crates/p2p/src/block_pool.rs`:
- Around line 1324-1328: The warn! call that logs waited_ms currently uses a
direct cast from started.elapsed().as_millis() as u64 which bypasses checked
conversions; replace that with a fallible conversion such as
u64::try_from(started.elapsed().as_millis()) and handle overflow explicitly
(e.g., clamp to u64::MAX or map Err to a sentinel value) before passing
waited_ms into the warn! macro, keeping the same metadata keys (parent.hash /
parent.height / waited_ms) so the log call in the BlockPool code uses a checked
conversion instead of an `as` cast.
- Around line 136-141: The ParentValidationOutcome enum conflates peer-side
rejections and local failures—split Invalid into two variants (e.g., Rejected
and LocalFailure) on ParentValidationOutcome, update the code paths that produce
the outcome (including where block_state_events closing is detected) to emit the
correct variant, and then change the downstream handling that currently converts
every Invalid into a "parent ... failed validation" BlockError so that Rejected
maps to the existing BlockError::ParentFailedValidation (or equivalent) while
LocalFailure is either propagated as an internal/shutdown/cache lifecycle error
or handled separately (do this for both places noted where Invalid is converted
into the parent-failed BlockError).

In `@crates/p2p/src/tests/block_pool/mod.rs`:
- Around line 982-983: Replace the plain async test attribute with the
tracing-enabled harness: change #[tokio::test] to #[test_log::test(tokio::test)]
on the async test function
process_block_does_not_re_pull_parent_in_tree_pending_validation and the four
other async tests referenced (the ones at the other noted locations). Ensure
each test function retains its async signature but uses
#[test_log::test(tokio::test)] so tracing is captured by the repository's test
harness.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 94305360-2df4-42bc-9ceb-5094bf4dda66

📥 Commits

Reviewing files that changed from the base of the PR and between 4a08aa3 and 7a2ec4e.

📒 Files selected for processing (8)
  • crates/api-client/proptest-regressions/lib.txt
  • crates/api-client/src/lib.rs
  • crates/chain-tests/Cargo.toml
  • crates/chain-tests/src/multi_node/vdf_validation_progress.rs
  • crates/p2p/src/block_pool.rs
  • crates/p2p/src/block_status_provider.rs
  • crates/p2p/src/metrics.rs
  • crates/p2p/src/tests/block_pool/mod.rs

@JesseTheRobot

Copy link
Copy Markdown
Member Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor
✅ Actions performed

Full review triggered.

…kStateUpdated

The prior `saw_head_update` gate required a `BlockStateUpdated{block_hash:
head_hash}` event before accepting `validation_service.is_closed()` as
success. That event is never emitted in the Stalled-panic path this test
exercises: `BlockStateUpdated` only fires from `on_block_validation_finished`,
but the VDF-wait panic in `active_validations.rs` re-unwinds through the
validation_service select loop via `std::panic::resume_unwind` (per the
"never mislabel" rule), so validation never "finishes". The gate therefore
sat unsatisfied until the 23s deadline and the test asserted failure.

Replace with a `block_tree.get_block(&head_hash).is_some()` check — proves
prevalidation reached peer's tree (i.e. the head was scheduled into
validation_service) before declaring its closure attributable to our
scenario. Addresses the reviewer's underlying concern (rule out unrelated
panics) using a signal that actually exists in this flow.
@JesseTheRobot

Copy link
Copy Markdown
Member Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor
✅ Actions performed

Full review triggered.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
crates/p2p/src/tests/block_pool/mod.rs (1)

972-972: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use test_log attributes on these async tests.

#[tokio::test] on Line 972, Line 1016, and Line 1054 should be #[test_log::test(tokio::test)] so tracing output is preserved for these block-pool flows.

Proposed change
-#[tokio::test]
+#[test_log::test(tokio::test)]
 async fn process_block_does_not_re_pull_parent_in_tree_pending_validation() {

-#[tokio::test]
+#[test_log::test(tokio::test)]
 async fn process_block_rejects_block_already_in_tree_pending_validation() {

-#[tokio::test]
+#[test_log::test(tokio::test)]
 async fn is_block_processing_or_processed_true_for_in_tree_pending_validation() {

As per coding guidelines, **/*.rs: Use #[test_log::test(tokio::test)] for async tests with tracing output.

Also applies to: 1016-1016, 1054-1054

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/p2p/src/tests/block_pool/mod.rs` at line 972, Replace the plain
#[tokio::test] attributes on the async test functions in this file with
#[test_log::test(tokio::test)] so tracing output is preserved; update the three
async tests currently annotated with #[tokio::test] (the block-pool test
functions around the occurrences at lines noted) to use
test_log::test(tokio::test) instead, keeping the rest of each test function
(name and body) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@crates/p2p/src/tests/block_pool/mod.rs`:
- Line 972: Replace the plain #[tokio::test] attributes on the async test
functions in this file with #[test_log::test(tokio::test)] so tracing output is
preserved; update the three async tests currently annotated with #[tokio::test]
(the block-pool test functions around the occurrences at lines noted) to use
test_log::test(tokio::test) instead, keeping the rest of each test function
(name and body) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 722eacde-59ef-4965-a4d8-01561dc1950e

📥 Commits

Reviewing files that changed from the base of the PR and between 4a08aa3 and 6cfa01f.

📒 Files selected for processing (7)
  • crates/api-client/proptest-regressions/lib.txt
  • crates/api-client/src/lib.rs
  • crates/chain-tests/Cargo.toml
  • crates/chain-tests/src/multi_node/vdf_validation_progress.rs
  • crates/p2p/src/block_pool.rs
  • crates/p2p/src/block_status_provider.rs
  • crates/p2p/src/tests/block_pool/mod.rs

@JesseTheRobot

Copy link
Copy Markdown
Member Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor
✅ Actions performed

Full review triggered.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
crates/p2p/src/tests/block_pool/mod.rs (1)

972-973: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Use #[test_log::test(tokio::test)] on the new async tests.

These tests exercise tracing-heavy block-pool paths, and plain #[tokio::test] drops the harness this repo expects for async test diagnostics.

Suggested change
-#[tokio::test]
+#[test_log::test(tokio::test)]
 async fn process_block_does_not_re_pull_parent_in_tree_pending_validation() {

-#[tokio::test]
+#[test_log::test(tokio::test)]
 async fn process_block_rejects_block_already_in_tree_pending_validation() {

-#[tokio::test]
+#[test_log::test(tokio::test)]
 async fn is_block_processing_or_processed_true_for_in_tree_pending_validation() {

As per coding guidelines, Use #[test_log::test(tokio::test)] for async tests with tracing output.

Also applies to: 1016-1017, 1054-1055

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/p2p/src/tests/block_pool/mod.rs` around lines 972 - 973, Replace plain
#[tokio::test] on async tests with the tracing-enabled wrapper
#[test_log::test(tokio::test)] so the test harness preserves tracing output;
specifically update the test attribute on
process_block_does_not_re_pull_parent_in_tree_pending_validation (and the other
async tests referenced around the same file ranges) by changing their attribute
lines from #[tokio::test] to #[test_log::test(tokio::test)].
crates/chain-tests/src/multi_node/vdf_validation_progress.rs (1)

224-228: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard the closed-service success path against a pre-existing dead validator.

head_reached_tree && head_still_pending only proves prevalidation inserted the head. If validation_service had already panicked before send_full_block, the head can still land in the tree, sit pending for SHORT_PROGRESS_TIMEOUT_SECS, and this branch will pass without ever exercising the stalled wait path.

Suggested change
+    let validation_service_was_open_at_delivery =
+        !peer.node_ctx.service_senders.validation_service.is_closed();
+    assert!(
+        validation_service_was_open_at_delivery,
+        "validation_service unexpectedly closed before head delivery",
+    );
     genesis
         .send_full_block(&peer, &head_header, head_payload, head_txs)
         .await?;
@@
-        if peer.node_ctx.service_senders.validation_service.is_closed()
+        if validation_service_was_open_at_delivery
+            && peer.node_ctx.service_senders.validation_service.is_closed()
             && head_reached_tree
             && stall_window_elapsed
             && head_still_pending
         {

Also applies to: 331-335

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/chain-tests/src/multi_node/vdf_validation_progress.rs` around lines
224 - 228, The success branch that currently checks only head_reached_tree &&
head_still_pending can false-positive if validation_service already panicked
before send_full_block; update the guard to also confirm the validation service
is still running (e.g., check the validation_service/JoinHandle state or a
shared alive flag) before taking the closed-service success path. Concretely,
extend the condition to require validation_service_is_alive (or
!validation_service_handle.is_finished()/has_panicked) alongside
head_reached_tree and head_still_pending, and apply the same additional check at
the other occurrence around lines 331-335 so a pre-existing dead validator
cannot short-circuit the stalled-wait path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/p2p/src/block_status_provider.rs`:
- Around line 91-99: The block_status() logic currently returns index-only
statuses (Finalized / PartOfAPrunedFork) as soon as the index contains the given
height, which prevents checking the in-tree state via
get_block_and_status(block_hash); update block_status() to first call
get_block_and_status(block_hash) and, if that returns a present tree entry, use
its status (InTreePendingValidation / ProcessedButCanBeReorganized) before
falling back to index-only checks; apply the same ordering fix to the other
similar branch around the 119-138 region so ChainState entries for a hash at an
indexed height are not ignored.

---

Duplicate comments:
In `@crates/chain-tests/src/multi_node/vdf_validation_progress.rs`:
- Around line 224-228: The success branch that currently checks only
head_reached_tree && head_still_pending can false-positive if validation_service
already panicked before send_full_block; update the guard to also confirm the
validation service is still running (e.g., check the
validation_service/JoinHandle state or a shared alive flag) before taking the
closed-service success path. Concretely, extend the condition to require
validation_service_is_alive (or
!validation_service_handle.is_finished()/has_panicked) alongside
head_reached_tree and head_still_pending, and apply the same additional check at
the other occurrence around lines 331-335 so a pre-existing dead validator
cannot short-circuit the stalled-wait path.

In `@crates/p2p/src/tests/block_pool/mod.rs`:
- Around line 972-973: Replace plain #[tokio::test] on async tests with the
tracing-enabled wrapper #[test_log::test(tokio::test)] so the test harness
preserves tracing output; specifically update the test attribute on
process_block_does_not_re_pull_parent_in_tree_pending_validation (and the other
async tests referenced around the same file ranges) by changing their attribute
lines from #[tokio::test] to #[test_log::test(tokio::test)].
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 415f8483-b112-4d2c-84b2-65e794db1213

📥 Commits

Reviewing files that changed from the base of the PR and between 4a08aa3 and d0e4185.

📒 Files selected for processing (7)
  • crates/api-client/proptest-regressions/lib.txt
  • crates/api-client/src/lib.rs
  • crates/chain-tests/Cargo.toml
  • crates/chain-tests/src/multi_node/vdf_validation_progress.rs
  • crates/p2p/src/block_pool.rs
  • crates/p2p/src/block_status_provider.rs
  • crates/p2p/src/tests/block_pool/mod.rs

Comment thread crates/p2p/src/block_status_provider.rs
@JesseTheRobot JesseTheRobot merged commit 47923f4 into master May 14, 2026
33 checks passed
@JesseTheRobot JesseTheRobot deleted the fix/block-pool-race branch May 14, 2026 14:16
DanMacDonald added a commit that referenced this pull request May 19, 2026
Single conflict in block_status_provider.rs:
- Kept our doc comment for PartOfAPrunedFork (describes deep reorg semantics)
- Took master's granular ChainState match for non-indexed blocks (block pool race fix #1417)
- Inlined is_block_in_the_tree() call for deep reorg check (replaces removed local variable)

Merges: MDBX metrics, VDF progress/stall detection, preval perf,
efficient-sampling, block pool race fix, Axiom removal, body-pull
circuit-breaker fix, metrics followup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant