fix: block pool race#1417
Conversation
📝 WalkthroughWalkthroughAdds an InTreePendingValidation block-status, updates BlockPool to treat in-tree-but-unvalidated blocks as already handled, extends unit and integration tests (including a rewritten VDF stalled-progress integration test), and updates api-client proptest artifacts and assertions. ChangesInTreePendingValidation handling and test updates
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/p2p/src/block_pool.rs`:
- Around line 1240-1283: The wait_for_parent_validation loop can block
indefinitely; update the async fn wait_for_parent_validation to use a timeout
when awaiting rx.recv() (e.g., via tokio::time::timeout) and decide a bounded
policy: on timeout call return ParentValidationOutcome::Invalid (conservative)
or log a warning and continue the loop (optimistic) — ensure you reference
service_senders.subscribe_block_state_updates() and
block_status_provider.block_status(parent_height, &parent_hash) behavior so you
only apply the timeout to the recv await, handle timeout results alongside
broadcast::error::RecvError::Lagged/Closed, and keep the function's return type
ParentValidationOutcome consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 01a403ea-8874-4a98-9d36-92ec916ed235
📒 Files selected for processing (5)
crates/chain-tests/Cargo.tomlcrates/chain-tests/src/multi_node/vdf_validation_progress.rscrates/p2p/src/block_pool.rscrates/p2p/src/block_status_provider.rscrates/p2p/src/tests/block_pool/mod.rs
323fbab to
bab11b3
Compare
|
Benchmark results: https://irys-xyz.github.io/irys/dev/bench/fix%2Fblock-pool-race/index.html |
…-pull The old `block_status_provider` collapsed every in-tree block into `ProcessedButCanBeReorganized`, so `block_pool::process_block` could not tell "parent in tree, validation still pending" from "parent in tree, validated". Descendants of an unvalidated parent fell through to `block_discovery.handle_block`, which added them to the tree, scheduled their validation, and left the parent-wait to `validation_service`'s `wait_for_parent_validation`. If the parent then failed validation it was removed from the tree, and every descendant already added cascade-failed with `Parent block not found` → `Cancelled` → invalid → also removed. On the testnet `GogWkmVot…` divergence this produced 628 orphan events / 354 redundant validation cycles for a single not-yet-validated parent. Surface the parent's `ChainState` through `BlockStatus`: - `BlockStatus::InTreePendingValidation` for `ChainState::NotOnchain` /`Validated` paired with `Unknown` or `ValidationScheduled`. - `is_in_tree()` predicate covering `InTreePendingValidation | ProcessedButCanBeReorganized | Finalized`. `is_processed()` stays strict (only the validated/finalized/pruned-fork set). - `block_status()` reads `BlockTree::get_block_and_status` and maps `ChainState` into the new variants. In `block_pool::process_block`: - Gate the orphan re-pull branch on `!is_in_tree()` instead of `!is_processed()`. The orphan branch now only fires when the parent is genuinely missing. - When the parent is `InTreePendingValidation`, hold the child via a new `wait_for_parent_validation` helper. The helper subscribes to `service_senders.block_state_events` *before* re-reading parent state (so a transition between the initial check and the subscribe is not lost), tolerates `Lagged` events by re-reading from the tree, and treats `Closed` as `Invalid`. - On `Valid` the child resumes the normal `process_block` path. On `Invalid` the child is removed from `blocks_cache` with a new `FailureReason::ParentValidationFailed` and the sync state records the failure. The descendant never reaches the tree, so a failed parent does not cascade through its children. - `check_block_status` adds an `InTreePendingValidation` arm returning `Advisory::AlreadyProcessed` (gossip handlers must not re-enter `process_block` for an already-in-tree block). - `is_block_processing_or_processed` includes `is_in_tree()` so gossip handlers also skip blocks that are in the tree but still pending. Inline unit tests in `block_status_provider` cover the three status transitions (in-tree pending → validated → missing).
Four functional tests against the real `BlockPool` (with a mocked block_discovery / mempool / service_senders) plus two helpers: - `process_block_does_not_re_pull_parent_in_tree_pending_validation`: adds a parent to the mock tree (default state = InTreePendingValidation), spawns `process_block(child)`, asserts no orphan-cascade message (`RequestBlockFromTheNetwork` / `AttemptReprocessingBlock`) reaches the sync channel during the wait and that the child has not fallen through to block_discovery. Then fires a `BlockStateUpdated` event promoting the parent to `Validated(ValidBlock)`, verifies the spawned future resolves with `Processed`, and that the child finally reaches block_discovery. - `process_block_drops_child_when_pending_parent_validation_fails`: removes the parent from the tree mid-flight and broadcasts a discarded `BlockStateUpdated`, asserts `process_block` returns `BlockError(... failed validation ...)`, child is not in block_discovery, child is not in the block_pool cache, and no orphan-cascade message was emitted. - `process_block_rejects_block_already_in_tree_pending_validation`: exercises `check_block_status`'s new arm — a process_block call for a block already in the tree as InTreePendingValidation returns `Advisory::AlreadyProcessed` and does not re-enter block_discovery. - `is_block_processing_or_processed_true_for_in_tree_pending_validation`: exercises the gossip-handler gate — before adding to the tree the predicate is false; after, it is true. `build_test_pool` collapses the 8-line per-test setup, and `is_orphan_cascade_message` is the narrow predicate the wait-path tests use against the sync channel (allowing the expected `BlockProcessedByThePool` notification to pass through).
…ction `heavy_test_vdf_progress_check_fails_stalled_peer` was added in e9bca30 as `#[ignore]`d documentation, pending a way to deliver a block to the peer whose parent is in the tree but whose VDF steps were never fast-forwarded into the peer's `vdf_state`. The original "gossip the head only" setup cannot exercise the progress check on its own: when the head's parent is `NotProcessed`, `block_pool` runs the orphan-fetch cascade, each fetched ancestor's `ensure_vdf_is_valid` calls `fast_forward_validated_steps`, and by the time the head reaches VDF validation the gap has been bridged. Rework the test to use direct tree injection: - Enable `irys-domain`'s `test-utils` feature in `chain-tests/Cargo.toml` so `BlockTreeReadGuard::write()` is available outside `#[cfg(test)]`. - Mine the head's parent privately on genesis, then drop it straight into peer's `block_tree` via `add_block` + `mark_block_as_validation_scheduled` + `mark_block_as_valid`. Snapshots are inherited from the grandparent (peer's height-2 tip) — valid for a no-commitment, non-epoch block. Crucially, this skips block_discovery / block_tree_service / validation_service entirely for the parent, so its VDF steps are never fast-forwarded into peer's `vdf_state`. - Mine the head privately, then deliver it via `send_full_block`. That call bypasses `block_pool` and the orphan cascade; block_discovery finds the parent already in the tree, schedules `ValidateBlock` for the head, and `ensure_vdf_is_valid` enters Stage A (`wait_for_step(prev_output_step)`). Peer cannot reach that step (mining is stopped, no fast-forward source for the intermediate steps), and after `progress_timeout_secs` the wait bails with `WaitForStepError::Stalled`. Update the assertion to match the current consensus-safety contract: a `Stalled` wait must panic per the never-mislabel rule (see `active_validations.rs:150` and `design/docs/vdf-validation-stall-detection.md`). The cleanest in-test signal is the validation_service task's `mpsc` receiver being dropped when it unwinds — poll `service_senders.validation_service.is_closed()`. The test also subscribes to `block_state_events` as a regression detector: a `Valid` event for the head would mean fast-forward bridged the gap somehow, and is treated as a loud test failure. As defense-in-depth, the test also asserts peer's `global_step` did not advance during the wait. Stable: 5/5 passes in ~9s each.
7a36fdd to
db79f11
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
crates/p2p/src/block_pool.rs (1)
1262-1305:⚠️ Potential issue | 🟠 Major | ⚡ Quick winBound the parent-validation wait.
This loop still has no timeout. If the parent stays stuck in
InTreePendingValidation, every descendant remainsis_processing = trueforever and accumulates in the pool behind it.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/p2p/src/block_pool.rs` around lines 1262 - 1305, The wait_for_parent_validation loop in async fn wait_for_parent_validation can block forever when a parent remains InTreePendingValidation; wrap the receive/loop in a bounded wait using tokio::time::timeout (or equivalent) with a configured duration constant (e.g. MAX_PARENT_VALIDATION_WAIT) and on timeout log a warning including parent_hash and return ParentValidationOutcome::Invalid (or a new Timeout variant if you prefer); ensure you still subscribe via service_senders.subscribe_block_state_updates() and re-check block_status(parent_height, &parent_hash) after each successful recv, and treat timeout like a failure to avoid descendants being stuck in is_processing forever.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/p2p/src/block_pool.rs`:
- Around line 128-136: ParentValidationOutcome currently conflates
peer-rejection and local failures; split it into two variants (e.g.,
InvalidRejected and InvalidLocal) in the ParentValidationOutcome enum and adjust
the wait/handler logic that matches on ParentValidationOutcome (the match site
that currently converts every Invalid into a `"parent ... failed validation"`
BlockError) so that only InvalidRejected is mapped to a BlockError and penalizes
the peer, while InvalidLocal (e.g., event channel closed, shutdown, cache/prune)
is handled locally without attributing a BlockError to the peer. Update all
match arms/usages of ParentValidationOutcome (the places around the existing
Invalid -> BlockError conversion and the other handling sites referenced in the
review) to branch on the new variants accordingly.
In `@crates/p2p/src/tests/block_pool/mod.rs`:
- Around line 979-980: Replace the plain #[tokio::test] on async tests with the
tracing-enabled attribute #[test_log::test(tokio::test)]; specifically change
the test declaration for
process_block_does_not_re_pull_parent_in_tree_pending_validation to use
#[test_log::test(tokio::test)] instead of #[tokio::test], and make the same
replacement for the other async tests in this file that currently use
#[tokio::test] so they retain the tracing harness during execution.
---
Duplicate comments:
In `@crates/p2p/src/block_pool.rs`:
- Around line 1262-1305: The wait_for_parent_validation loop in async fn
wait_for_parent_validation can block forever when a parent remains
InTreePendingValidation; wrap the receive/loop in a bounded wait using
tokio::time::timeout (or equivalent) with a configured duration constant (e.g.
MAX_PARENT_VALIDATION_WAIT) and on timeout log a warning including parent_hash
and return ParentValidationOutcome::Invalid (or a new Timeout variant if you
prefer); ensure you still subscribe via
service_senders.subscribe_block_state_updates() and re-check
block_status(parent_height, &parent_hash) after each successful recv, and treat
timeout like a failure to avoid descendants being stuck in is_processing
forever.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 81dce36f-0b60-452c-b880-66b64ae49904
📒 Files selected for processing (7)
crates/api-client/proptest-regressions/lib.txtcrates/api-client/src/lib.rscrates/chain-tests/Cargo.tomlcrates/chain-tests/src/multi_node/vdf_validation_progress.rscrates/p2p/src/block_pool.rscrates/p2p/src/block_status_provider.rscrates/p2p/src/tests/block_pool/mod.rs
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/p2p/src/block_pool.rs`:
- Line 1327: Replace the unchecked runtime cast for waited_ms (currently using
started.elapsed().as_millis() as u64) with a checked conversion: call
u64::try_from(started.elapsed().as_millis()) (or
started.elapsed().as_millis().try_into()) and handle the Result to avoid
lossy/undefined behavior—e.g., on Err either cap to u64::MAX or return/log an
error—so replace the direct `as u64` on the started.elapsed().as_millis()
expression with a safe try-into conversion and explicit error/overflow handling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4a3e0b05-9069-47e1-9cac-215f5fd52dec
📒 Files selected for processing (2)
crates/p2p/src/block_pool.rscrates/p2p/src/metrics.rs
… path Fills two of the four gaps called out in review.md P2: - Two unit tests for `block_status`: locally-produced blocks inserted as `Validated(Unknown)` and `Validated(ValidationScheduled)` must map to `InTreePendingValidation`, the new branch added in b0b8db8. - One integration test that deterministically drives `broadcast::RecvError::Lagged` inside `wait_for_parent_validation` by burst-sending 201 events synchronously into the 100-cap channel before promoting the parent to valid — verifies the wait loop re-reads tree state on lag and still exits cleanly. Adds `BlockStatusProvider::add_block_mock_with_state` test helper that wraps `BlockTree::add_common` so tests can plant arbitrary `ChainState`. Two gaps deferred: `RecvError::Closed` (Sender lives inside Arc held by the running BlockPool; not testable as black-box without DI refactor) and the subscribe-vs-initial-check race window (µs-scale; safety is structural ordering in source).
Replaces "Fix 2" / "commit b0b8db8" / numbered-issue references with self-contained summaries describing what the code actually does. The external post-mortem doc those numbers were keyed to lives outside this repo, so the citations would rot. Comment-only; no behavior change.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
crates/p2p/src/block_pool.rs (1)
133-142:⚠️ Potential issue | 🟠 MajorStill applies: split local wait failures from parent-rejection outcomes.
Line 1313 still turns
RecvError::ClosedintoParentValidationOutcome::Invalid, and Lines 879-885 still translate everyInvalidinto aBlockErrorthat says the parent failed validation. That continues to misattribute local shutdown/validation-service death to the peer that sent the child.Also applies to: 866-885, 1313-1318
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/p2p/src/block_pool.rs` around lines 133 - 142, The code currently maps RecvError::Closed to ParentValidationOutcome::Invalid and then treats every Invalid as a parent-validation failure when creating a BlockError, which misattributes local shutdowns to the peer; change ParentValidationOutcome (add a new variant like WaitFailed or ChannelClosed) so RecvError::Closed maps to that new variant instead of Invalid, and update all match sites that convert ParentValidationOutcome::Invalid into a BlockError (the mapping around BlockError creation) to only produce the “parent failed validation” BlockError for the genuine Invalid variant while producing a different handling/error (or silently drop) for the new WaitFailed/ChannelClosed variant so local validation-service shutdowns aren’t blamed on the sender. Ensure you update all uses of ParentValidationOutcome (including InTreePendingValidation wait logic and the block-rejection path) to handle the new variant.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/chain-tests/src/multi_node/vdf_validation_progress.rs`:
- Around line 249-301: The loop currently treats any closure of
validation_service as a success even if we never observed a head-specific event;
add a guard to ensure the service-close path only counts after we've seen an
event for head_hash. Introduce a boolean (e.g., saw_head_update) initialized
false, set it true whenever an event with event.block_hash == head_hash is
processed (before handling ValidationResult cases), and change the closure check
to: if peer.node_ctx.service_senders.validation_service.is_closed() &&
saw_head_update { peer_validation_dead = true; break; } so unrelated prior
panics won't make the test pass.
---
Duplicate comments:
In `@crates/p2p/src/block_pool.rs`:
- Around line 133-142: The code currently maps RecvError::Closed to
ParentValidationOutcome::Invalid and then treats every Invalid as a
parent-validation failure when creating a BlockError, which misattributes local
shutdowns to the peer; change ParentValidationOutcome (add a new variant like
WaitFailed or ChannelClosed) so RecvError::Closed maps to that new variant
instead of Invalid, and update all match sites that convert
ParentValidationOutcome::Invalid into a BlockError (the mapping around
BlockError creation) to only produce the “parent failed validation” BlockError
for the genuine Invalid variant while producing a different handling/error (or
silently drop) for the new WaitFailed/ChannelClosed variant so local
validation-service shutdowns aren’t blamed on the sender. Ensure you update all
uses of ParentValidationOutcome (including InTreePendingValidation wait logic
and the block-rejection path) to handle the new variant.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: b80f2ee7-f968-45bc-a929-ed7604e09562
📒 Files selected for processing (5)
crates/chain-tests/src/multi_node/vdf_validation_progress.rscrates/p2p/src/block_pool.rscrates/p2p/src/block_status_provider.rscrates/p2p/src/metrics.rscrates/p2p/src/tests/block_pool/mod.rs
|
Caution Review failedAn error occurred during the review process. Please try again later. Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an InTreePendingValidation BlockStatus, refactors block-status mapping, makes BlockPool wait for parent validation (dropping children on invalid outcome), records a parent-wait metric, adds unit and integration tests, and enables ChangesInTreePendingValidation Orphan-Cascade Prevention
Sequence Diagram(s)sequenceDiagram
participant ComponentA
participant ComponentB
ComponentA->>ComponentB: observable interaction
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
♻️ Duplicate comments (4)
crates/chain-tests/src/multi_node/vdf_validation_progress.rs (1)
255-294:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRequire a head-specific signal before treating service shutdown as success.
is_closed()on Line 291 can still make this test pass after an unrelated priorvalidation_servicefailure, even ifhead_hashnever entered the path under test. Track whether you've observed a state update forhead_hashfirst, and only accept the closed-service branch after that.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/chain-tests/src/multi_node/vdf_validation_progress.rs` around lines 255 - 294, The test currently treats validation_service.is_closed() as success even if we've never observed any event for the target head_hash; add a boolean flag (e.g., seen_head_event) initialized false before the loop, set it true when you handle an event with event.block_hash == head_hash (the branch that logs and inspects event.validation_result from event_rx), and change the closed-service check (peer.node_ctx.service_senders.validation_service.is_closed()) to only set peer_validation_dead and break if seen_head_event is true; keep all other logic around event_rx, head_hash, peer_validation_dead, and ValidationResult unchanged.crates/p2p/src/tests/block_pool/mod.rs (1)
982-983: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winSwitch these async tests to
test_log.These new tests drive tracing-heavy block-pool paths, so plain
#[tokio::test]drops the harness this repo expects for debugging flakes. Use#[test_log::test(tokio::test)]on all five.As per coding guidelines,
**/*.rs: Use#[test_log::test(tokio::test)]for async tests with tracing output.Also applies to: 1075-1076, 1162-1163, 1204-1205, 1298-1299
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/p2p/src/tests/block_pool/mod.rs` around lines 982 - 983, Replace the plain async test attribute with the tracing-enabled harness: change #[tokio::test] to #[test_log::test(tokio::test)] on the async test function process_block_does_not_re_pull_parent_in_tree_pending_validation and the four other async tests referenced (the ones at the other noted locations). Ensure each test function retains its async signature but uses #[test_log::test(tokio::test)] so tracing is captured by the repository's test harness.crates/p2p/src/block_pool.rs (2)
1324-1328: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winUse a checked conversion for
waited_ms.
started.elapsed().as_millis()is a runtimeu128; theas u64cast on Line 1327 bypasses the codebase's checked-conversion rule for production Rust. Useu64::try_from(...)and clamp or handle overflow explicitly.Based on learnings, in this codebase’s production Rust code, do not use
asnumeric casts for runtime numeric conversion; prefertry_from/try_intowith explicit overflow handling.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/p2p/src/block_pool.rs` around lines 1324 - 1328, The warn! call that logs waited_ms currently uses a direct cast from started.elapsed().as_millis() as u64 which bypasses checked conversions; replace that with a fallible conversion such as u64::try_from(started.elapsed().as_millis()) and handle overflow explicitly (e.g., clamp to u64::MAX or map Err to a sentinel value) before passing waited_ms into the warn! macro, keeping the same metadata keys (parent.hash / parent.height / waited_ms) so the log call in the BlockPool code uses a checked conversion instead of an `as` cast.
136-141:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSplit parent rejection from local wait failures.
ParentValidationOutcome::Invalidcurrently bundles actual parent rejection together with local failure modes likeblock_state_eventsclosing. Lines 866-885 then convert every one of those into"parent ... failed validation"BlockError, which misattributes local shutdown/cache lifecycle failures to the peer.Also applies to: 866-885, 1313-1318
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/p2p/src/block_pool.rs` around lines 136 - 141, The ParentValidationOutcome enum conflates peer-side rejections and local failures—split Invalid into two variants (e.g., Rejected and LocalFailure) on ParentValidationOutcome, update the code paths that produce the outcome (including where block_state_events closing is detected) to emit the correct variant, and then change the downstream handling that currently converts every Invalid into a "parent ... failed validation" BlockError so that Rejected maps to the existing BlockError::ParentFailedValidation (or equivalent) while LocalFailure is either propagated as an internal/shutdown/cache lifecycle error or handled separately (do this for both places noted where Invalid is converted into the parent-failed BlockError).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@crates/chain-tests/src/multi_node/vdf_validation_progress.rs`:
- Around line 255-294: The test currently treats validation_service.is_closed()
as success even if we've never observed any event for the target head_hash; add
a boolean flag (e.g., seen_head_event) initialized false before the loop, set it
true when you handle an event with event.block_hash == head_hash (the branch
that logs and inspects event.validation_result from event_rx), and change the
closed-service check
(peer.node_ctx.service_senders.validation_service.is_closed()) to only set
peer_validation_dead and break if seen_head_event is true; keep all other logic
around event_rx, head_hash, peer_validation_dead, and ValidationResult
unchanged.
In `@crates/p2p/src/block_pool.rs`:
- Around line 1324-1328: The warn! call that logs waited_ms currently uses a
direct cast from started.elapsed().as_millis() as u64 which bypasses checked
conversions; replace that with a fallible conversion such as
u64::try_from(started.elapsed().as_millis()) and handle overflow explicitly
(e.g., clamp to u64::MAX or map Err to a sentinel value) before passing
waited_ms into the warn! macro, keeping the same metadata keys (parent.hash /
parent.height / waited_ms) so the log call in the BlockPool code uses a checked
conversion instead of an `as` cast.
- Around line 136-141: The ParentValidationOutcome enum conflates peer-side
rejections and local failures—split Invalid into two variants (e.g., Rejected
and LocalFailure) on ParentValidationOutcome, update the code paths that produce
the outcome (including where block_state_events closing is detected) to emit the
correct variant, and then change the downstream handling that currently converts
every Invalid into a "parent ... failed validation" BlockError so that Rejected
maps to the existing BlockError::ParentFailedValidation (or equivalent) while
LocalFailure is either propagated as an internal/shutdown/cache lifecycle error
or handled separately (do this for both places noted where Invalid is converted
into the parent-failed BlockError).
In `@crates/p2p/src/tests/block_pool/mod.rs`:
- Around line 982-983: Replace the plain async test attribute with the
tracing-enabled harness: change #[tokio::test] to #[test_log::test(tokio::test)]
on the async test function
process_block_does_not_re_pull_parent_in_tree_pending_validation and the four
other async tests referenced (the ones at the other noted locations). Ensure
each test function retains its async signature but uses
#[test_log::test(tokio::test)] so tracing is captured by the repository's test
harness.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 94305360-2df4-42bc-9ceb-5094bf4dda66
📒 Files selected for processing (8)
crates/api-client/proptest-regressions/lib.txtcrates/api-client/src/lib.rscrates/chain-tests/Cargo.tomlcrates/chain-tests/src/multi_node/vdf_validation_progress.rscrates/p2p/src/block_pool.rscrates/p2p/src/block_status_provider.rscrates/p2p/src/metrics.rscrates/p2p/src/tests/block_pool/mod.rs
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
…kStateUpdated
The prior `saw_head_update` gate required a `BlockStateUpdated{block_hash:
head_hash}` event before accepting `validation_service.is_closed()` as
success. That event is never emitted in the Stalled-panic path this test
exercises: `BlockStateUpdated` only fires from `on_block_validation_finished`,
but the VDF-wait panic in `active_validations.rs` re-unwinds through the
validation_service select loop via `std::panic::resume_unwind` (per the
"never mislabel" rule), so validation never "finishes". The gate therefore
sat unsatisfied until the 23s deadline and the test asserted failure.
Replace with a `block_tree.get_block(&head_hash).is_some()` check — proves
prevalidation reached peer's tree (i.e. the head was scheduled into
validation_service) before declaring its closure attributable to our
scenario. Addresses the reviewer's underlying concern (rule out unrelated
panics) using a signal that actually exists in this flow.
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
crates/p2p/src/tests/block_pool/mod.rs (1)
972-972:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse
test_logattributes on these async tests.
#[tokio::test]on Line 972, Line 1016, and Line 1054 should be#[test_log::test(tokio::test)]so tracing output is preserved for these block-pool flows.Proposed change
-#[tokio::test] +#[test_log::test(tokio::test)] async fn process_block_does_not_re_pull_parent_in_tree_pending_validation() { -#[tokio::test] +#[test_log::test(tokio::test)] async fn process_block_rejects_block_already_in_tree_pending_validation() { -#[tokio::test] +#[test_log::test(tokio::test)] async fn is_block_processing_or_processed_true_for_in_tree_pending_validation() {As per coding guidelines,
**/*.rs: Use#[test_log::test(tokio::test)]for async tests with tracing output.Also applies to: 1016-1016, 1054-1054
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/p2p/src/tests/block_pool/mod.rs` at line 972, Replace the plain #[tokio::test] attributes on the async test functions in this file with #[test_log::test(tokio::test)] so tracing output is preserved; update the three async tests currently annotated with #[tokio::test] (the block-pool test functions around the occurrences at lines noted) to use test_log::test(tokio::test) instead, keeping the rest of each test function (name and body) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@crates/p2p/src/tests/block_pool/mod.rs`:
- Line 972: Replace the plain #[tokio::test] attributes on the async test
functions in this file with #[test_log::test(tokio::test)] so tracing output is
preserved; update the three async tests currently annotated with #[tokio::test]
(the block-pool test functions around the occurrences at lines noted) to use
test_log::test(tokio::test) instead, keeping the rest of each test function
(name and body) unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 722eacde-59ef-4965-a4d8-01561dc1950e
📒 Files selected for processing (7)
crates/api-client/proptest-regressions/lib.txtcrates/api-client/src/lib.rscrates/chain-tests/Cargo.tomlcrates/chain-tests/src/multi_node/vdf_validation_progress.rscrates/p2p/src/block_pool.rscrates/p2p/src/block_status_provider.rscrates/p2p/src/tests/block_pool/mod.rs
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
crates/p2p/src/tests/block_pool/mod.rs (1)
972-973: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winUse
#[test_log::test(tokio::test)]on the new async tests.These tests exercise tracing-heavy block-pool paths, and plain
#[tokio::test]drops the harness this repo expects for async test diagnostics.Suggested change
-#[tokio::test] +#[test_log::test(tokio::test)] async fn process_block_does_not_re_pull_parent_in_tree_pending_validation() { -#[tokio::test] +#[test_log::test(tokio::test)] async fn process_block_rejects_block_already_in_tree_pending_validation() { -#[tokio::test] +#[test_log::test(tokio::test)] async fn is_block_processing_or_processed_true_for_in_tree_pending_validation() {As per coding guidelines,
Use #[test_log::test(tokio::test)] for async tests with tracing output.Also applies to: 1016-1017, 1054-1055
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/p2p/src/tests/block_pool/mod.rs` around lines 972 - 973, Replace plain #[tokio::test] on async tests with the tracing-enabled wrapper #[test_log::test(tokio::test)] so the test harness preserves tracing output; specifically update the test attribute on process_block_does_not_re_pull_parent_in_tree_pending_validation (and the other async tests referenced around the same file ranges) by changing their attribute lines from #[tokio::test] to #[test_log::test(tokio::test)].crates/chain-tests/src/multi_node/vdf_validation_progress.rs (1)
224-228:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard the closed-service success path against a pre-existing dead validator.
head_reached_tree && head_still_pendingonly proves prevalidation inserted the head. Ifvalidation_servicehad already panicked beforesend_full_block, the head can still land in the tree, sit pending forSHORT_PROGRESS_TIMEOUT_SECS, and this branch will pass without ever exercising the stalled wait path.Suggested change
+ let validation_service_was_open_at_delivery = + !peer.node_ctx.service_senders.validation_service.is_closed(); + assert!( + validation_service_was_open_at_delivery, + "validation_service unexpectedly closed before head delivery", + ); genesis .send_full_block(&peer, &head_header, head_payload, head_txs) .await?; @@ - if peer.node_ctx.service_senders.validation_service.is_closed() + if validation_service_was_open_at_delivery + && peer.node_ctx.service_senders.validation_service.is_closed() && head_reached_tree && stall_window_elapsed && head_still_pending {Also applies to: 331-335
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/chain-tests/src/multi_node/vdf_validation_progress.rs` around lines 224 - 228, The success branch that currently checks only head_reached_tree && head_still_pending can false-positive if validation_service already panicked before send_full_block; update the guard to also confirm the validation service is still running (e.g., check the validation_service/JoinHandle state or a shared alive flag) before taking the closed-service success path. Concretely, extend the condition to require validation_service_is_alive (or !validation_service_handle.is_finished()/has_panicked) alongside head_reached_tree and head_still_pending, and apply the same additional check at the other occurrence around lines 331-335 so a pre-existing dead validator cannot short-circuit the stalled-wait path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/p2p/src/block_status_provider.rs`:
- Around line 91-99: The block_status() logic currently returns index-only
statuses (Finalized / PartOfAPrunedFork) as soon as the index contains the given
height, which prevents checking the in-tree state via
get_block_and_status(block_hash); update block_status() to first call
get_block_and_status(block_hash) and, if that returns a present tree entry, use
its status (InTreePendingValidation / ProcessedButCanBeReorganized) before
falling back to index-only checks; apply the same ordering fix to the other
similar branch around the 119-138 region so ChainState entries for a hash at an
indexed height are not ignored.
---
Duplicate comments:
In `@crates/chain-tests/src/multi_node/vdf_validation_progress.rs`:
- Around line 224-228: The success branch that currently checks only
head_reached_tree && head_still_pending can false-positive if validation_service
already panicked before send_full_block; update the guard to also confirm the
validation service is still running (e.g., check the
validation_service/JoinHandle state or a shared alive flag) before taking the
closed-service success path. Concretely, extend the condition to require
validation_service_is_alive (or
!validation_service_handle.is_finished()/has_panicked) alongside
head_reached_tree and head_still_pending, and apply the same additional check at
the other occurrence around lines 331-335 so a pre-existing dead validator
cannot short-circuit the stalled-wait path.
In `@crates/p2p/src/tests/block_pool/mod.rs`:
- Around line 972-973: Replace plain #[tokio::test] on async tests with the
tracing-enabled wrapper #[test_log::test(tokio::test)] so the test harness
preserves tracing output; specifically update the test attribute on
process_block_does_not_re_pull_parent_in_tree_pending_validation (and the other
async tests referenced around the same file ranges) by changing their attribute
lines from #[tokio::test] to #[test_log::test(tokio::test)].
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 415f8483-b112-4d2c-84b2-65e794db1213
📒 Files selected for processing (7)
crates/api-client/proptest-regressions/lib.txtcrates/api-client/src/lib.rscrates/chain-tests/Cargo.tomlcrates/chain-tests/src/multi_node/vdf_validation_progress.rscrates/p2p/src/block_pool.rscrates/p2p/src/block_status_provider.rscrates/p2p/src/tests/block_pool/mod.rs
Single conflict in block_status_provider.rs: - Kept our doc comment for PartOfAPrunedFork (describes deep reorg semantics) - Took master's granular ChainState match for non-indexed blocks (block pool race fix #1417) - Inlined is_block_in_the_tree() call for deep reorg check (replaces removed local variable) Merges: MDBX metrics, VDF progress/stall detection, preval perf, efficient-sampling, block pool race fix, Axiom removal, body-pull circuit-breaker fix, metrics followup.
Describe the changes
Adds a new BlockStatus::InTreePendingValidation variant, which allows the node to distinguish between blocks in the tree (ProcessedButCanBeReorganized) and those in the tree, but that haven't been validated yet (the new InTreePendingValidation) - this allows us to prevent a second-order orphan child "storm" if hte parent fails validation and/or is evicted from the block tree.
Related Issue(s)
Fixes the orphan child block re-process storm when the parent is evicted from the tree.
Checklist
Summary by CodeRabbit
Bug Fixes
Tests
Chores