fix: validation error disambiguation by JesseTheRobot · Pull Request #1416 · Irys-xyz/irys

JesseTheRobot · 2026-05-13T13:26:30Z

Describe the changes
A clear and concise description of what the bug fix, feature, or improvement is.

Related Issue(s)
Please link to the issue(s) that will be closed with this PR.

Checklist

Tests have been added/updated for the changes.
Documentation has been updated for the changes (if applicable).
The code follows Rust's style guidelines.

Additional Context
Add any other context about the pull request here.

Summary by CodeRabbit

Bug Fixes
- Validation routing tightened: unrecoverable internal faults now cause immediate exits; soft-internal and consensus rejects are tracked and rebroadcast to allow recovery.
New Checks
- Typed detection for out-of-range ledger chunks, missing snapshots/payloads, and more explicit cancellation reasons with bounded retry behavior.
Metrics
- Added counters for soft-internal discard/recovery and concurrent-validation cancel/requeue events.
New Features
- Configurable execution-payload wait timeout and automatic reprocessing for parked blocks.
Tests
- Expanded coverage for classification, retry caps, merge priority, and recovery ordering.
Documentation
- Node must be run under an external supervisor to auto-restart on local faults.

coderabbitai · 2026-05-13T13:26:44Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
✅ Review completed - (🔄 Check again to review again)

📝 Walkthrough

Walkthrough

Separate internal/node-fault vs consensus validation outcomes; thread typed cancellations and payload-wait errors through validation tasks and services; centralize discard/broadcast with soft-internal LRU + recovery logging; switch payload-wait to watch-based notifiers; add P2P parked-block dedup/reprocess and many test/config updates.

Changes

Validation core and API

Layer / File(s)	Summary
Prevalidation & classifiers `crates/actors/src/block_validation.rs`	Adds non-consensus prevalidation variants, typed ValidationCancelReason, ExecutionPayloadUnavailable, and classification helpers (`is_node_fault` / `is_internal_failure`) plus tests.
ValidationResult wrappers & conversions `crates/actors/src/block_tree_service.rs`	Adds `InternalFailureError` and `ConsensusRejectionError`, remodels `ValidationResult`, updates metric labeling, and implements conversions from `ValidationError`/`PreValidationError`.

Validation orchestration

Layer / File(s)	Summary
BlockValidationTask stages & merging `crates/actors/src/validation_service/block_validation_task.rs`	Parent-wait reasons, abortable PoA blocking task, per-stage typed error mapping, and merge logic prioritizing node-fault InternalFailure then Invalid then cancellation/soft-internal.
ValidationService delivery & active bookkeeping `crates/actors/src/validation_service.rs`, `crates/actors/src/validation_service/active_validations.rs`	VDF parent-missing sentinel and lookup helper; `send_validation_result_via` panics only on node-fault InternalFailure; concurrent cancellation requeue/give-up cap and bookkeeping; tests for delivery and lookup.

Block tree behavior

Layer / File(s)	Summary
Discard & broadcast paths `crates/actors/src/block_tree_service.rs`	Adds `discard_and_broadcast`, bounded `recent_soft_internal_discards` LRU, `soft_internal_reason_tag` mapping, log-level split, panic on node-fault InternalFailure (including shutdown drain), and Valid-path LRU pop/recovery logging and metrics.

Shadow tx, refunds, metrics

Layer / File(s)	Summary
ShadowTx generator `crates/actors/src/shadow_tx_generator.rs`	Introduce `ShadowTxGenError` typed errors, snapshot-taint handling, map many generator failures to variants, and add regression tests.
Commitment refunds & metrics `crates/actors/src/commitment_refunds.rs`, `crates/actors/src/metrics.rs`	Add `CommitmentRefundError`, update refund function signatures, and add counters/helpers for concurrent cancel requeue/repeat and soft-internal discard/recovered labeled by reason.

Domain & cache

Layer / File(s)	Summary
BlockIndex anchored bounds `crates/domain/src/models/block_index.rs`	Add `get_block_bounds_at_height` with anchor-bounded search tolerant to late ledger introduction and tests for edge cases.
ExecutionPayloadCache watch-based `crates/domain/src/models/execution_payload_cache.rs`	Switch to per-hash `watch` notifiers, add `ExecutionPayloadWaitError` (ReceiverDisrupted/WaitTimeout), accept `wait_timeout` in constructor, and update wait APIs to return Result with tests.

P2P & block pool

Layer / File(s)	Summary
BlockPool prevalidation & dedup `crates/p2p/src/block_pool.rs`	Map wait-for-payload failures into retryable internal error, add `CachePoisoned`/`ParentNotInCache` branches, classify prevalidation via `.classify()`, add parked-block dedup state machine with `last_reprocess_at`, cooldown, `dedup_status_for_gossip`, and `dedup_and_maybe_emit_reprocess`, and tests.
Gossip/ingest call-sites `crates/p2p/src/gossip_data_handler.rs`, `crates/p2p/src/chain_sync.rs`	Replace simple processed predicate with `dedup_and_maybe_emit_reprocess` to allow reprocessing emissions for parked blocks.

Tests and call-site updates

Layer / File(s)	Summary
Chain-tests and actor-tests `crates/chain-tests/src/`, `crates/actors/tests/`	Update tests to new validation signatures (pass parent epoch/EMA snapshots), broaden matchers for concurrent rejection shapes, add PoA fallback regression test, and adjust actor test setups for BlockTree guarded calls.
ExecutionPayloadCache wiring `crates/chain/src/chain.rs`, `crates/p2p/src/tests/*`, `crates/p2p/src/tests/util.rs`, `crates/types/src/config/node.rs`	Pass configured `execution_payload_wait_timeout_millis` into `ExecutionPayloadCache::new`, add config field and defaults, and update tests/fixtures.
Docs & manifests `MAINNET_BETA.md`, `Cargo.toml`, `crates/actors/Cargo.toml`	Document mandatory external supervisor restart behavior on unrecoverable local faults; add `tracing-test` workspace dependency and crate dev-dep.

Possibly related PRs

Irys-xyz/irys#1156: Overlaps on crates/actors/src/block_tree_service.rs and BlockValidationFinished handling/tracing instrumentation.
Irys-xyz/irys#1159: Touches crates/actors/src/block_validation.rs and related prevalidation/validation signature changes.
Irys-xyz/irys#1118: Related PreValidationError variant evolution impacting validation classification.

Suggested reviewers

DanMacDonald
glottologist

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/validation-error-disamb

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/actors/src/block_validation.rs`:
- Around line 1421-1441: The code currently falls through to get_block_bounds
when latest.ledgers has no entry for ledger, causing missing-ledger to be
treated as an internal BlockBoundsLookupError; change the early disambiguation
in the block_index_guard read (the block where latest.ledgers.iter().find(...)
is used) to explicitly handle the None case for the ledger: if the find(...)
returns None, return a consensus-invalid PreValidationError (either reuse
PreValidationError::PoAChunkOffsetOutOfBlockBounds or add a new variant like
PreValidationError::PoALedgerNotPresent) instead of continuing to
get_block_bounds; keep the existing check that returns
PoAChunkOffsetOutOfBlockBounds when ledger_chunk_offset >= chain_max for
Some(chain_max).

In `@crates/actors/src/validation_service/block_validation_task.rs`:
- Around line 57-62: The validate_block path still re-reads the parent's epoch
snapshot from the block tree which reintroduces the eviction race; modify
validate_block to use the captured self.parent_epoch_snapshot (the
Arc<EpochSnapshot> field) everywhere it's needed (not only in
data_txs_are_valid) — specifically replace calls to
block_tree.get_epoch_snapshot()/similar lookups before PoA checks, shadow-tx
validation, and when computing cascade_active; derive cascade_active from the
captured snapshot via
config.consensus.hardforks.is_cascade_active_for_epoch(&self.parent_epoch_snapshot)
so the validator uses the same activation rule as the producer and avoids
ParentEpochSnapshotMissing races.

In `@crates/chain-tests/src/block_production/block_validation.rs`:
- Around line 436-450: The test currently computes cascade_active from
tree.canonical_epoch_snapshot(), causing divergence at epoch boundaries;
instead, derive cascade_active from the parent epoch snapshot you already
fetched: call
config_override.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch)
(or &parent_epoch_snapshot) after obtaining parent_epoch via
tree.get_epoch_snapshot(&parent_hash) and replace the use of
tree.canonical_epoch_snapshot() with this parent-based check so cascade_active
aligns with the parent_epoch_snapshot used elsewhere in the block validation
test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 39f6ce7f-60cf-46ee-8775-7fe15e27f992

📥 Commits

Reviewing files that changed from the base of the PR and between 39a69b1 and d300e7c.

📒 Files selected for processing (6)

crates/actors/src/block_tree_service.rs
crates/actors/src/block_validation.rs
crates/actors/src/validation_service.rs
crates/actors/src/validation_service/block_validation_task.rs
crates/chain-tests/src/block_production/block_validation.rs
crates/chain-tests/src/validation/mod.rs

github-actions · 2026-05-13T13:41:10Z

Benchmark results: https://irys-xyz.github.io/irys/dev/bench/fix%2Fvalidation-error-disamb/index.html

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/actors/src/validation_service/block_validation_task.rs (1)
683-689: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't map every shadow-tx stage failure to ShadowTransactionInvalid.

shadow_transactions_are_valid() returns eyre::Result, so this branch currently turns payload-cache misses, DB/view failures, and missing parent lookups into ValidationError::ShadowTransactionInvalid. .into() then classifies all of them as consensus-invalid, which reintroduces peer-attribution for local failures.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/actors/src/validation_service/block_validation_task.rs` around lines
683 - 689, The current match on shadow_tx_result treats any eyre::Result error
as ShadowTransactionInvalid; instead inspect the underlying error returned by
shadow_transactions_are_valid() (use err.downcast_ref::<...>() or pattern-match
known error enums from the payload cache / DB / parent lookup) and only map
genuine transaction-invalid errors to ValidationError::ShadowTransactionInvalid;
map cache-miss / DB/view failures / missing-parent cases to a transient/internal
validation result (e.g., ValidationError::InternalError or a transient variant
you have) so local failures are not classified as consensus-invalid. Reference
shadow_tx_result, shadow_transactions_are_valid(), and
ValidationError::ShadowTransactionInvalid when updating the match arm.

♻️ Duplicate comments (2)

crates/actors/src/validation_service/block_validation_task.rs (1)

384-400: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Reuse one captured parent snapshot set across the whole validation pass.

validate_block() still re-reads the parent epoch snapshot before PoA and again before shadow/cascade activation, and data_txs_are_valid() is still invoked without the parent snapshots/EMA. That means a valid block can still fail later with ParentEpochSnapshotMissing / Local*SnapshotMissing if the parent is evicted after scheduling. This change reclassifies the race, but it does not eliminate it.

Based on learnings: ensure the validator uses the same activation rule as the producer by deriving cascade_active from the parent_epoch_snapshot via config.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch_snapshot).

Also applies to: 477-499, 633-641
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/actors/src/validation_service/block_validation_task.rs` around lines
384 - 400, The code currently re-reads the parent epoch snapshot multiple times
in validate_block() (the get_epoch_snapshot call that assigns
parent_epoch_snapshot) and then later before PoA and before shadow/cascade
activation, which can race if the parent is evicted; change validate_block() to
capture the parent_epoch_snapshot once (the value returned by
block_tree_guard.read().get_epoch_snapshot(&block.previous_block_hash)) and
reuse that same snapshot for all subsequent checks and for computing
cascade_active using
config.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch_snapshot);
also pass this captured parent_epoch_snapshot (and derived EMA/activation flag)
into data_txs_are_valid() instead of re-reading the snapshot; apply the same
single-capture-and-reuse fix to the other occurrences referenced (the similar
get_epoch_snapshot usages around the 477-499 and 633-641 regions).

crates/actors/src/block_validation.rs (1)

1480-1495: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Treat an absent latest-ledger entry as consensus-invalid too.

This pre-check only rejects Some(chain_max) if ledger_chunk_offset >= chain_max. If latest.ledgers has no entry for ledger, the code still falls through to get_block_bounds(...), and that error is now classified as BlockBoundsLookupError/internal. A PoA targeting a ledger that is not present on this chain will therefore be retried instead of rejected.

Suggested direction

         let bb = {
             let index = block_index_guard.read();
-            if let Some(latest) = index.get_latest_item()
-                && let Some(chain_max) = latest
-                    .ledgers
-                    .iter()
-                    .find(|l| l.ledger == ledger)
-                    .map(|l| l.total_chunks)
-                && ledger_chunk_offset >= chain_max
-            {
-                return Err(PreValidationError::PoAChunkOffsetOutOfBlockBounds);
+            if let Some(latest) = index.get_latest_item() {
+                match latest
+                    .ledgers
+                    .iter()
+                    .find(|l| l.ledger == ledger)
+                    .map(|l| l.total_chunks)
+                {
+                    Some(chain_max) if ledger_chunk_offset >= chain_max => {
+                        return Err(PreValidationError::PoAChunkOffsetOutOfBlockBounds);
+                    }
+                    None => {
+                        return Err(PreValidationError::PoAChunkOffsetOutOfBlockBounds);
+                    }
+                    _ => {}
+                }
             }
             index
                 .get_block_bounds(ledger, LedgerChunkOffset::from(ledger_chunk_offset))
                 .map_err(|e| PreValidationError::BlockBoundsLookupError(e.to_string()))?
         };

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/actors/src/block_validation.rs` around lines 1480 - 1495, The check
currently only rejects when latest.ledgers contains the ledger and the offset is
out-of-bounds, but if latest.ledgers has no entry for ledger we should also
reject as consensus-invalid instead of falling through to get_block_bounds;
modify the block starting with let index = block_index_guard.read(); to
explicitly treat the case where latest exists but latest.ledgers.iter().find(|l|
l.ledger == ledger) is None as
Err(PreValidationError::PoAChunkOffsetOutOfBlockBounds) (i.e. return that error
before calling index.get_block_bounds), keeping the existing early-return when
ledger_chunk_offset >= chain_max and preserving the existing
get_block_bounds(...) call for the remaining path. Use the symbols index,
latest, latest.ledgers.iter().find(...), ledger, ledger_chunk_offset, and
PreValidationError::PoAChunkOffsetOutOfBlockBounds to locate and change the
logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/actors/src/block_tree_service.rs`:
- Around line 415-457: The InternalFailure branch in
on_block_validation_finished leaves the block marked as scheduled/validating and
never re-enqueues it, which can permanently stall the block and its descendants;
update this branch to clear the "validation scheduled" flag and re-schedule
validation for transient internal failures: after computing height/state (using
cache.get_block and cache.get_block_and_status as shown) call the routine that
clears the scheduled validation marker (e.g.,
clear_scheduled_validation(block_hash) or unset the scheduled flag in the cache
entry) and then enqueue the block for validation again (e.g.,
schedule_validation(block_hash) or push into the validation queue) so that
on_block_prevalidated will not short-circuit future retries; keep the existing
BlockStateUpdated event and logging, and ensure any cache writes use the same
locking/error handling pattern as elsewhere to avoid deadlocks.

---

Outside diff comments:
In `@crates/actors/src/validation_service/block_validation_task.rs`:
- Around line 683-689: The current match on shadow_tx_result treats any
eyre::Result error as ShadowTransactionInvalid; instead inspect the underlying
error returned by shadow_transactions_are_valid() (use err.downcast_ref::<...>()
or pattern-match known error enums from the payload cache / DB / parent lookup)
and only map genuine transaction-invalid errors to
ValidationError::ShadowTransactionInvalid; map cache-miss / DB/view failures /
missing-parent cases to a transient/internal validation result (e.g.,
ValidationError::InternalError or a transient variant you have) so local
failures are not classified as consensus-invalid. Reference shadow_tx_result,
shadow_transactions_are_valid(), and ValidationError::ShadowTransactionInvalid
when updating the match arm.

---

Duplicate comments:
In `@crates/actors/src/block_validation.rs`:
- Around line 1480-1495: The check currently only rejects when latest.ledgers
contains the ledger and the offset is out-of-bounds, but if latest.ledgers has
no entry for ledger we should also reject as consensus-invalid instead of
falling through to get_block_bounds; modify the block starting with let index =
block_index_guard.read(); to explicitly treat the case where latest exists but
latest.ledgers.iter().find(|l| l.ledger == ledger) is None as
Err(PreValidationError::PoAChunkOffsetOutOfBlockBounds) (i.e. return that error
before calling index.get_block_bounds), keeping the existing early-return when
ledger_chunk_offset >= chain_max and preserving the existing
get_block_bounds(...) call for the remaining path. Use the symbols index,
latest, latest.ledgers.iter().find(...), ledger, ledger_chunk_offset, and
PreValidationError::PoAChunkOffsetOutOfBlockBounds to locate and change the
logic.

In `@crates/actors/src/validation_service/block_validation_task.rs`:
- Around line 384-400: The code currently re-reads the parent epoch snapshot
multiple times in validate_block() (the get_epoch_snapshot call that assigns
parent_epoch_snapshot) and then later before PoA and before shadow/cascade
activation, which can race if the parent is evicted; change validate_block() to
capture the parent_epoch_snapshot once (the value returned by
block_tree_guard.read().get_epoch_snapshot(&block.previous_block_hash)) and
reuse that same snapshot for all subsequent checks and for computing
cascade_active using
config.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch_snapshot);
also pass this captured parent_epoch_snapshot (and derived EMA/activation flag)
into data_txs_are_valid() instead of re-reading the snapshot; apply the same
single-capture-and-reuse fix to the other occurrences referenced (the similar
get_epoch_snapshot usages around the 477-499 and 633-641 regions).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: dbf07fde-a385-4341-859b-b2525f1dc16b

📥 Commits

Reviewing files that changed from the base of the PR and between d300e7c and 3125039.

📒 Files selected for processing (3)

crates/actors/src/block_tree_service.rs
crates/actors/src/block_validation.rs
crates/actors/src/validation_service/block_validation_task.rs

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/actors/src/validation_service/block_validation_task.rs (1)
684-688: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't collapse every shadow-tx failure into ShadowTransactionInvalid.

shadow_transactions_are_valid() also fails for local/runtime conditions such as missing execution payloads, DB/block-tree lookup failures, and parent-snapshot eviction. Converting every Err here into a consensus rejection defeats the new InternalFailure split and can still discard valid blocks.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/actors/src/validation_service/block_validation_task.rs` around lines
684 - 688, The current match on shadow_tx_result collapses all failures into
ValidationError::ShadowTransactionInvalid; instead inspect the error returned by
shadow_transactions_are_valid (shadow_tx_result) and match its specific variants
so that transient/local/runtime failures (e.g. missing execution payloads,
DB/block-tree lookup failures, parent-snapshot eviction) are returned as an
InternalFailure variant while only true consensus-rejection errors map to
ValidationError::ShadowTransactionInvalid; update the match arm handling
Err(err) to distinguish error kinds (using the error enum/variants from
shadow_transactions_are_valid or helper methods on the error) and return
ValidationError::InternalFailure(...) for those local/runtime cases and
ValidationError::ShadowTransactionInvalid(err.to_string()) only for genuine
shadow-tx validation rejections.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/actors/src/validation_service/block_validation_task.rs`:
- Around line 684-688: The current match on shadow_tx_result collapses all
failures into ValidationError::ShadowTransactionInvalid; instead inspect the
error returned by shadow_transactions_are_valid (shadow_tx_result) and match its
specific variants so that transient/local/runtime failures (e.g. missing
execution payloads, DB/block-tree lookup failures, parent-snapshot eviction) are
returned as an InternalFailure variant while only true consensus-rejection
errors map to ValidationError::ShadowTransactionInvalid; update the match arm
handling Err(err) to distinguish error kinds (using the error enum/variants from
shadow_transactions_are_valid or helper methods on the error) and return
ValidationError::InternalFailure(...) for those local/runtime cases and
ValidationError::ShadowTransactionInvalid(err.to_string()) only for genuine
shadow-tx validation rejections.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d715b1f8-4191-4ad7-9763-fbc8d2bef4f7

📥 Commits

Reviewing files that changed from the base of the PR and between 3125039 and 9c13980.

📒 Files selected for processing (3)

crates/actors/src/block_tree_service.rs
crates/actors/src/block_validation.rs
crates/actors/src/validation_service/block_validation_task.rs

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/actors/src/validation_service/block_validation_task.rs (1)
530-539: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don’t classify every shadow-tx error as consensus-invalid.

This stage still turns every eyre failure into ValidationResult::Invalid(ShadowTransactionInvalid). That includes non-consensus failures from get_commitment_snapshot(...), execution-payload provider calls, and other runtime errors, so transient/internal failures will still discard blocks and the "shadow_tx" metric is forced into the "invalid" bucket. Have the shadow-tx stage return a typed ValidationResult (or typed error) so only actual rule violations become Invalid; propagate infra/runtime failures as InternalFailure.

Also applies to: 562-572, 706-712
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/actors/src/validation_service/block_validation_task.rs` around lines
530 - 539, The shadow-tx validation code currently maps every eyre failure into
ValidationResult::Invalid(ShadowTransactionInvalid) (e.g., the
block_tree_guard.read().get_commitment_snapshot(...) call and similar calls
later), which wrongly treats infra/runtime errors as consensus violations;
change the shadow-tx stage to return a typed Result<ValidationResult,
ShadowTxError> (or a ValidationResult enum that distinguishes Invalid(reason) vs
InternalFailure(error)), update callers of validate_shadow_transactions / the
shadow-tx stage to propagate InternalFailure for non-rule errors (including
errors from get_commitment_snapshot, execution-payload provider calls, and other
runtime failures) and only produce
ValidationResult::Invalid(ShadowTransactionInvalid) for explicit consensus rule
violations; ensure error construction preserves the original eyre error for
InternalFailure so logs/metrics can classify transient infra failures
separately.

♻️ Duplicate comments (1)

crates/actors/src/validation_service/block_validation_task.rs (1)
407-423: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Capture the parent epoch snapshot once and reuse it across validation.

validate_block() still re-reads the parent epoch snapshot from the live block tree for PoA and then again for shadow-tx/cascade. If that parent is pruned after scheduling, a valid block can still fail with ParentEpochSnapshotMissing. Capture the snapshot once before spawning the stage tasks and reuse that same Arc for PoA, shadow-tx validation, and cascade_active.

Based on learnings: ensure the validator uses the same activation rule as the producer by deriving cascade_active from the parent_epoch_snapshot via config.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch_snapshot).

Also applies to: 500-522
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/actors/src/validation_service/block_validation_task.rs` around lines
407 - 423, Capture the parent epoch snapshot once at the start of
validate_block() by calling
block_tree_guard.read().get_epoch_snapshot(&block.previous_block_hash) and
storing the returned Arc as parent_epoch_snapshot (reuse the Some(snapshot)
branch value) instead of re-reading it later; then pass that same
parent_epoch_snapshot Arc into the PoA validation, shadow-tx/cascade validation
tasks and derive cascade_active from it via
config.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch_snapshot)
so all stages use the identical snapshot and activation rule (replace subsequent
calls that re-query the block tree with the stored parent_epoch_snapshot).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/actors/src/validation_service/block_validation_task.rs`:
- Around line 530-539: The shadow-tx validation code currently maps every eyre
failure into ValidationResult::Invalid(ShadowTransactionInvalid) (e.g., the
block_tree_guard.read().get_commitment_snapshot(...) call and similar calls
later), which wrongly treats infra/runtime errors as consensus violations;
change the shadow-tx stage to return a typed Result<ValidationResult,
ShadowTxError> (or a ValidationResult enum that distinguishes Invalid(reason) vs
InternalFailure(error)), update callers of validate_shadow_transactions / the
shadow-tx stage to propagate InternalFailure for non-rule errors (including
errors from get_commitment_snapshot, execution-payload provider calls, and other
runtime failures) and only produce
ValidationResult::Invalid(ShadowTransactionInvalid) for explicit consensus rule
violations; ensure error construction preserves the original eyre error for
InternalFailure so logs/metrics can classify transient infra failures
separately.

---

Duplicate comments:
In `@crates/actors/src/validation_service/block_validation_task.rs`:
- Around line 407-423: Capture the parent epoch snapshot once at the start of
validate_block() by calling
block_tree_guard.read().get_epoch_snapshot(&block.previous_block_hash) and
storing the returned Arc as parent_epoch_snapshot (reuse the Some(snapshot)
branch value) instead of re-reading it later; then pass that same
parent_epoch_snapshot Arc into the PoA validation, shadow-tx/cascade validation
tasks and derive cascade_active from it via
config.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch_snapshot)
so all stages use the identical snapshot and activation rule (replace subsequent
calls that re-query the block tree with the stored parent_epoch_snapshot).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 556dd5d7-e59a-4338-9fff-e6ba26158f23

📥 Commits

Reviewing files that changed from the base of the PR and between d52dad3 and d727f9f.

📒 Files selected for processing (1)

crates/actors/src/validation_service/block_validation_task.rs

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/actors/src/block_validation.rs`:
- Around line 1424-1442: Convert the table-driven tests to rstest parameterized
cases: change the test function validation_cancel_reason_classifier_dispatch to
use #[rstest] with a #[case] for each ValidationCancelReason variant
(HeightDifference, ParentMissing, ChannelClosed), accept a parameter named
reason, and run the two assertions inside the body
(assert!(!reason.is_internal()) and
assert!(!ValidationError::ValidationCancelled { reason
}.is_internal_failure())). Do the same refactor for the other table-driven test
referenced (lines 1448-1501) so each variant is a separate #[case], producing
per-case test failures and following the repo convention.
- Around line 1507-1510: Add a regression test that constructs a "latest index"
entry that omits the target PoA ledger and calls poa_is_valid() on that
constructed entry, asserting it returns
Err(PreValidationError::PoALedgerInactive { ledger_id: <expected> });
specifically, instead of only instantiating
PreValidationError::PoALedgerInactive directly, build the latest index entry
(using the same builders/helpers used elsewhere in this module) without adding
the target ledger so the validation path that previously produced
BlockBoundsLookupError is exercised, then call poa_is_valid() and assert it
yields the PoALedgerInactive variant.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6f9bf9bd-2063-4459-b850-c8d26a1510e4

📥 Commits

Reviewing files that changed from the base of the PR and between d52dad3 and 3e913fb.

📒 Files selected for processing (4)

crates/actors/src/block_tree_service.rs
crates/actors/src/block_validation.rs
crates/actors/src/validation_service.rs
crates/actors/src/validation_service/block_validation_task.rs

JesseTheRobot

@coderabbitai full review

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

crates/actors/src/block_validation.rs (1)
1487-1505: 🧹 Nitpick | 🔵 Trivial

Convert these manual classifier tables to rstest cases.

These tests are already table-driven; #[rstest] will give per-case failures and keep them aligned with the repo's test conventions. As per coding guidelines Use rstest for parameterized tests.

Also applies to: 1580-1608, 1614-1638, 1654-1704
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/actors/src/block_validation.rs` around lines 1487 - 1505, Replace the
manual loop/table-style assertions in the test function
validation_cancel_reason_classifier_dispatch with an rstest parameterized test:
add #[rstest] and #[case(...)] entries for each ValidationCancelReason variant
(e.g., HeightDifference, ParentMissing, ChannelClosed), accept a reason:
ValidationCancelReason parameter, and run the two assertions against
reason.is_internal() and ValidationError::ValidationCancelled { reason
}.is_internal_failure(); do the same conversion for the other similar test
blocks that check classifier tables so each case shows per-case failures under
rstest.
crates/actors/src/block_tree_service.rs (1)
435-485: ⚠️ Potential issue | 🟠 Major

Soft InternalFailure still leaves the block permanently unschedulable.

This path keeps the cache entry and returns without clearing the validation-scheduled state or providing another retry path. Because on_block_prevalidated() already short-circuits when the block is still cached, a transient internal failure can strand the block — and any descendants waiting on it — until depth pruning eventually evicts it.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/actors/src/block_tree_service.rs` around lines 435 - 485, The
InternalFailure branch in on_block_validation_finished currently leaves the
cache entry as-is which preserves the "validation-scheduled" state and prevents
on_block_prevalidated from ever retrying; change this path to clear the
scheduled/validating flag and enqueue the block for retry (or at least mark it
as not-scheduled) instead of just returning. Specifically, in the
ValidationResult::InternalFailure arm in on_block_validation_finished, update
the cached block status (via cache.get_block_and_status / cache mutation
helpers) to remove the validation-scheduled marker (or set state to a retryable
state), update chain_sync_state.record_block_validation_error as needed, and
emit the BlockStateUpdated event with the new state so
service_senders.block_state_events consumers can retry; ensure this complements
on_block_prevalidated's short-circuit logic so transient internal failures are
retried rather than permanently stranded.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/actors/src/block_validation.rs`:
- Around line 699-725: The is_internal_failure() method currently treats all
unlisted ValidationError variants as consensus-invalid via the `_ => false`
fallback; change that fallback to conservative `_ => true` and/or add explicit
match arms returning true for the variants produced by runtime/local-dependent
helpers (the errors constructed by recall_recall_range_is_valid(),
shadow_transactions_are_valid(), submit_payload_to_reth(), etc.) so those
runtime/transport/VDF-related failure cases are classified as internal; edit the
match in is_internal_failure to return true for those specific variants (or flip
the default) rather than defaulting to false.

---

Duplicate comments:
In `@crates/actors/src/block_tree_service.rs`:
- Around line 435-485: The InternalFailure branch in
on_block_validation_finished currently leaves the cache entry as-is which
preserves the "validation-scheduled" state and prevents on_block_prevalidated
from ever retrying; change this path to clear the scheduled/validating flag and
enqueue the block for retry (or at least mark it as not-scheduled) instead of
just returning. Specifically, in the ValidationResult::InternalFailure arm in
on_block_validation_finished, update the cached block status (via
cache.get_block_and_status / cache mutation helpers) to remove the
validation-scheduled marker (or set state to a retryable state), update
chain_sync_state.record_block_validation_error as needed, and emit the
BlockStateUpdated event with the new state so service_senders.block_state_events
consumers can retry; ensure this complements on_block_prevalidated's
short-circuit logic so transient internal failures are retried rather than
permanently stranded.

In `@crates/actors/src/block_validation.rs`:
- Around line 1487-1505: Replace the manual loop/table-style assertions in the
test function validation_cancel_reason_classifier_dispatch with an rstest
parameterized test: add #[rstest] and #[case(...)] entries for each
ValidationCancelReason variant (e.g., HeightDifference, ParentMissing,
ChannelClosed), accept a reason: ValidationCancelReason parameter, and run the
two assertions against reason.is_internal() and
ValidationError::ValidationCancelled { reason }.is_internal_failure(); do the
same conversion for the other similar test blocks that check classifier tables
so each case shows per-case failures under rstest.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9e6a57e1-ab83-4181-b413-d2c607c47e4c

📥 Commits

Reviewing files that changed from the base of the PR and between d52dad3 and 8b669b4.

📒 Files selected for processing (6)

crates/actors/src/block_tree_service.rs
crates/actors/src/block_validation.rs
crates/actors/src/validation_service.rs
crates/actors/src/validation_service/block_validation_task.rs
crates/chain-tests/src/block_production/block_validation.rs
crates/p2p/src/block_pool.rs

JesseTheRobot · 2026-05-14T23:17:44Z

@coderabbitai full review

coderabbitai · 2026-05-14T23:17:49Z

✅ Actions performed

Full review triggered.

coderabbitai

♻️ Duplicate comments (4)

crates/actors/src/validation_service.rs (1)
436-462: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a cancellation result for aborted concurrent tasks.

Line 436 already distinguishes e.is_cancelled(), but Lines 455-462 still report both branches as ValidationError::TaskPanicked. That turns aborted tasks into InternalFailure(TaskPanicked), which skews both diagnostics and the new invalid/internal-failure split. Route cancellations through the existing cancellation variant instead of the panic variant.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/actors/src/validation_service.rs` around lines 436 - 462, The code
treats both cancelled and panicked concurrent validation tasks as
ValidationError::TaskPanicked; change the branch so when e.is_cancelled() is
true you send the cancellation validation variant instead of TaskPanicked (i.e.,
use the existing cancellation enum variant — e.g. ValidationError::Cancelled or
ValidationError::TaskCancelled — with the same task/details fields) in the call
to self.send_validation_result, leaving the panic branch to continue sending
ValidationError::TaskPanicked; update the error message/logging if needed so the
sent result matches e.is_cancelled().
crates/actors/src/block_tree_service.rs (1)
415-485: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Soft InternalFailure still leaves the block unschedulable.

This branch keeps the block in cache and returns without clearing ValidationScheduled or re-queueing. Because on_block_prevalidated exits early for cached blocks on Lines 296-303, neither re-gossip nor orphan resolution can retry validation until depth pruning evicts the entry. A transient internal failure can therefore stall the block and its descendants indefinitely.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/actors/src/block_tree_service.rs` around lines 415 - 485, The
InternalFailure branch in on_block_validation_finished leaves the cache entry in
ValidationScheduled, which prevents on_block_prevalidated from retrying the
block; update this branch to clear the ValidationScheduled marker (or change the
block's status to a non-scheduled state) and/or re-enqueue the block for
prevalidation so transient internal failures don't stall validation permanently:
locate the ValidationResult::InternalFailure handling in
on_block_validation_finished, after recording the error with
chain_sync_state.record_block_validation_error and before returning, mutate the
cache entry that matches block_hash to remove ValidationScheduled (or set state
to NotOnchain/Unknown) and/or push the block back onto the prevalidation queue
so on_block_prevalidated can attempt validation again; ensure the
BlockStateUpdated event (BlockStateUpdated { ... validation_result }) reflects
the new status and that service_senders.block_state_events still broadcasts the
update.
crates/actors/src/block_validation.rs (1)
703-739: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep mixed runtime wrappers out of the consensus-invalid fallback.

RecallRangeInvalid and ExecutionLayerFailed still fall through _ => false here, but both construction paths can carry local/runtime failures: recall_recall_range_is_valid() can fail on steps_guard.read().get_steps(...), and submit_payload_to_reth() still uses ? for engine RPC/transport errors. In those cases a local outage is still reported as Invalid, so a valid block gets discarded instead of retained for retry/escalation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/actors/src/block_validation.rs` around lines 703 - 739, The
is_internal_failure() matcher currently treats RecallRangeInvalid and
ExecutionLayerFailed as consensus-invalid (false) even though their inner causes
can be local/runtime outages; update is_internal_failure() to inspect those
variant payloads and return true when their contained error indicates a
local/internal failure (e.g., errors produced by
recall_recall_range_is_valid()’s steps_guard read/get_steps or engine/transport
errors from submit_payload_to_reth()). Concretely, add arms for
Self::RecallRangeInvalid { source, .. } and Self::ExecutionLayerFailed { source,
.. } (or the exact field names in ValidationError) and call a helper on the
inner error (or match its concrete error kinds) to return true for
IO/transport/lock/read-related failures and false for genuine consensus-invalid
reasons so that local outages are classified as InternalFailure instead of
Invalid.
crates/actors/src/validation_service/block_validation_task.rs (1)
424-440: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reuse the first parent epoch snapshot instead of re-reading it.

This task fetches the parent epoch snapshot for PoA and then fetches the same snapshot again before spawning the shadow/data stages. If the parent is evicted between those two reads, a block that already had a valid captured snapshot now fails with ParentEpochSnapshotMissing. Clone the first Arc<EpochSnapshot> into the later tasks and only fetch the EMA snapshot in the second block.
Based on learnings: ensure the validator uses the same activation rule as the producer by deriving cascade_active from the parent_epoch_snapshot via config.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch_snapshot).

Also applies to: 516-553
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/actors/src/validation_service/block_validation_task.rs` around lines
424 - 440, The code currently re-reads the parent epoch snapshot (via
block_tree_guard.read().get_epoch_snapshot(&block.previous_block_hash)) before
spawning shadow/data stages which can fail if the snapshot is evicted; instead,
clone the already obtained parent_epoch_snapshot Arc and pass that clone into
the later tasks, remove the second get_epoch_snapshot call so only the EMA
snapshot is fetched there, and compute cascade_active by calling
config.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch_snapshot)
(apply the same change to the analogous block around the 516-553 range). Ensure
all references to the parent snapshot in spawn/shadow/data-stage code use the
cloned parent_epoch_snapshot variable rather than re-reading from
block_tree_guard.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@crates/actors/src/block_tree_service.rs`:
- Around line 415-485: The InternalFailure branch in
on_block_validation_finished leaves the cache entry in ValidationScheduled,
which prevents on_block_prevalidated from retrying the block; update this branch
to clear the ValidationScheduled marker (or change the block's status to a
non-scheduled state) and/or re-enqueue the block for prevalidation so transient
internal failures don't stall validation permanently: locate the
ValidationResult::InternalFailure handling in on_block_validation_finished,
after recording the error with chain_sync_state.record_block_validation_error
and before returning, mutate the cache entry that matches block_hash to remove
ValidationScheduled (or set state to NotOnchain/Unknown) and/or push the block
back onto the prevalidation queue so on_block_prevalidated can attempt
validation again; ensure the BlockStateUpdated event (BlockStateUpdated { ...
validation_result }) reflects the new status and that
service_senders.block_state_events still broadcasts the update.

In `@crates/actors/src/block_validation.rs`:
- Around line 703-739: The is_internal_failure() matcher currently treats
RecallRangeInvalid and ExecutionLayerFailed as consensus-invalid (false) even
though their inner causes can be local/runtime outages; update
is_internal_failure() to inspect those variant payloads and return true when
their contained error indicates a local/internal failure (e.g., errors produced
by recall_recall_range_is_valid()’s steps_guard read/get_steps or
engine/transport errors from submit_payload_to_reth()). Concretely, add arms for
Self::RecallRangeInvalid { source, .. } and Self::ExecutionLayerFailed { source,
.. } (or the exact field names in ValidationError) and call a helper on the
inner error (or match its concrete error kinds) to return true for
IO/transport/lock/read-related failures and false for genuine consensus-invalid
reasons so that local outages are classified as InternalFailure instead of
Invalid.

In `@crates/actors/src/validation_service.rs`:
- Around line 436-462: The code treats both cancelled and panicked concurrent
validation tasks as ValidationError::TaskPanicked; change the branch so when
e.is_cancelled() is true you send the cancellation validation variant instead of
TaskPanicked (i.e., use the existing cancellation enum variant — e.g.
ValidationError::Cancelled or ValidationError::TaskCancelled — with the same
task/details fields) in the call to self.send_validation_result, leaving the
panic branch to continue sending ValidationError::TaskPanicked; update the error
message/logging if needed so the sent result matches e.is_cancelled().

In `@crates/actors/src/validation_service/block_validation_task.rs`:
- Around line 424-440: The code currently re-reads the parent epoch snapshot
(via block_tree_guard.read().get_epoch_snapshot(&block.previous_block_hash))
before spawning shadow/data stages which can fail if the snapshot is evicted;
instead, clone the already obtained parent_epoch_snapshot Arc and pass that
clone into the later tasks, remove the second get_epoch_snapshot call so only
the EMA snapshot is fetched there, and compute cascade_active by calling
config.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch_snapshot)
(apply the same change to the analogous block around the 516-553 range). Ensure
all references to the parent snapshot in spawn/shadow/data-stage code use the
cloned parent_epoch_snapshot variable rather than re-reading from
block_tree_guard.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0de83319-ef08-4782-b2a7-3ae735dc4776

📥 Commits

Reviewing files that changed from the base of the PR and between 47923f4 and 6e285b5.

📒 Files selected for processing (6)

crates/actors/src/block_tree_service.rs
crates/actors/src/block_validation.rs
crates/actors/src/validation_service.rs
crates/actors/src/validation_service/block_validation_task.rs
crates/chain-tests/src/block_production/block_validation.rs
crates/p2p/src/block_pool.rs

JesseTheRobot · 2026-05-19T18:33:37Z

@coderrabbitai full review

JesseTheRobot · 2026-05-19T18:36:45Z

@coderabbitai full review

coderabbitai · 2026-05-19T18:36:51Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/domain/src/models/block_index.rs (1)
253-293: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Sparse ledger entries still break anchored lookups.

This method assumes the anchor block itself contains ledger, and later treats missing probed ledgers as 0. That only works for “ledger not introduced yet”. In this file, push_block() writes entries only for ledgers present in block.data_ledgers, so if a later block omits an already-existing ledger, anchoring to that height will either fail early or binary-search with the wrong total. The anchored query needs the carried-forward total from the latest <= anchor_height block that has that ledger.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/domain/src/models/block_index.rs` around lines 253 - 293, The anchored
lookup wrongly assumes the anchor block contains the ledger and treats missing
ledgers during probes as total_chunks=0; instead, when computing anchor_max you
must walk backwards from anchor_height to find the most recent block <=
anchor_height that contains ledger and use its total_chunks (use
block_index_item_by_height in a loop until found or zero), and likewise when
probing in the binary search replace unwrap_or(0) behavior with a carry-forward
lookup that finds the latest prior block <= mid that has the ledger (not just
treating absent ledgers as 0); adjust uses of chunk_offset, anchor_height and
ledger accordingly so anchors and mid probes use carried-forward totals
consistent with how push_block writes only present ledgers.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/actors/src/shadow_tx_generator.rs`:
- Around line 237-255: The check that detects a promoted publish transaction
also scheduled for a perm_fee refund should return a peer-structure error, not a
snapshot invariant; change the error constructor used in the loop that iterates
publish_ledger.txs and ledger_expiry_balance_delta.user_perm_fee_refunds from
ShadowTxGenError::SnapshotInvariant(...) to ShadowTxGenError::Structural(...) so
the condition is classified as a malformed block shape (consistent with the
surrounding comment and generate_expected_shadow_transactions validation
behavior).

---

Outside diff comments:
In `@crates/domain/src/models/block_index.rs`:
- Around line 253-293: The anchored lookup wrongly assumes the anchor block
contains the ledger and treats missing ledgers during probes as total_chunks=0;
instead, when computing anchor_max you must walk backwards from anchor_height to
find the most recent block <= anchor_height that contains ledger and use its
total_chunks (use block_index_item_by_height in a loop until found or zero), and
likewise when probing in the binary search replace unwrap_or(0) behavior with a
carry-forward lookup that finds the latest prior block <= mid that has the
ledger (not just treating absent ledgers as 0); adjust uses of chunk_offset,
anchor_height and ledger accordingly so anchors and mid probes use
carried-forward totals consistent with how push_block writes only present
ledgers.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: de963991-3d42-46b3-961d-b9b80fd5121f

📥 Commits

Reviewing files that changed from the base of the PR and between 3293b75 and 425f23d.

📒 Files selected for processing (22)

MAINNET_BETA.md
crates/actors/src/block_producer.rs
crates/actors/src/block_tree_service.rs
crates/actors/src/block_validation.rs
crates/actors/src/commitment_refunds.rs
crates/actors/src/metrics.rs
crates/actors/src/shadow_tx_generator.rs
crates/actors/src/validation_service.rs
crates/actors/src/validation_service/active_validations.rs
crates/actors/src/validation_service/block_validation_task.rs
crates/actors/tests/block_validation_tests.rs
crates/chain-tests/src/block_production/block_validation.rs
crates/chain-tests/src/multi_node/vdf_validation_progress.rs
crates/chain-tests/src/utils.rs
crates/chain-tests/src/validation/data_tx_pricing.rs
crates/chain-tests/src/validation/poa_cases.rs
crates/domain/src/models/block_index.rs
crates/domain/src/models/execution_payload_cache.rs
crates/p2p/src/block_pool.rs
crates/p2p/src/block_status_provider.rs
crates/p2p/src/tests/block_pool/mod.rs
crates/p2p/src/tests/integration/mod.rs

JesseTheRobot · 2026-05-20T12:57:56Z

@coderabbitai full review

coderabbitai · 2026-05-20T12:58:03Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 7

♻️ Duplicate comments (1)

crates/actors/src/shadow_tx_generator.rs (1)

267-271: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Classify the promoted publish/refund overlap as Structural.

This guard is still rejecting malformed peer block shape, not corrupt local snapshot state. Returning SnapshotInvariant here routes the same input through the internal/node-fault path instead of consensus rejection for any caller that misses the earlier validation-site check.

Minimal fix

-                    return Err(ShadowTxGenError::SnapshotInvariant(format!(
+                    return Err(ShadowTxGenError::Structural(format!(
                         "Transaction {} is in publish ledger but also has a perm_fee refund scheduled. \
                         Promoted transactions should not receive refunds.",
                         tx.id
                     )));

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/actors/src/shadow_tx_generator.rs` around lines 267 - 271, The error
returned when a transaction is both in the publish ledger and has a perm_fee
refund should be classified as structural, not a snapshot invariant: update the
return in shadow_tx_generator (where it currently returns
ShadowTxGenError::SnapshotInvariant(...)) to return
ShadowTxGenError::Structural(...) with the same descriptive message so malformed
peer block shape is routed as consensus/structural rejection rather than an
internal snapshot invariant failure.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/actors/src/validation_service/block_validation_task.rs`:
- Around line 856-895: The cancel path currently uses tokio::join! (stages_join)
which only becomes Ready when all six stage futures finish, so a
partially-completed stage (e.g., a NodeFault) is missed when cancellation fires;
fix by replacing the tokio::join!-based composite (stages_join) with a drainable
collection (e.g., futures::stream::FuturesUnordered or individually polled
futures) that yields per-stage completions, wire that into the select with
exit_if_block_is_too_old and, in the cancel branch after aborting poa via
poa_abort_slot, repeatedly poll/drain the collection to collect any Ready stage
results and then call merge_stage_results_with_cancel (using the same
StageOutcome variants: Joined, JoinedAfterCancel, Cancelled) so
already-completed NodeFaults are observed and merged before returning.
- Around line 561-584: The Err branch in the match that maps `res` to
`ValidationResult` currently treats all `Err(err)` as
`ValidationError::TaskPanicked`; update that branch to first check
`err.is_cancelled()` (the `JoinError` from the PoA `JoinHandle`) and if true
return `ValidationError::ValidationCancelled` (preserving a helpful `details`
string), otherwise preserve the existing panic handling and return
`ValidationError::TaskPanicked` (as currently done); refer to the match around
`let result: ValidationResult = match res { ... }`, the `AbortHandle::abort()`
usage that can produce cancelled `JoinError`, and the
`ValidationError::TaskPanicked` / `ValidationError::ValidationCancelled`
variants to locate where to implement the `is_cancelled()` check and mapping.

In `@crates/chain-tests/src/validation/data_tx_pricing.rs`:
- Around line 149-155: The test currently treats any
ValidationError::ShadowTransactionInvalid(_) as acceptable which allows
unrelated failures to pass; update both match arms that now use
ValidationError::ShadowTransactionInvalid(_) so they only accept the
shadow-validation variant(s) that indicate a perm-fee problem (i.e., replace the
wildcard with the concrete inner variant(s) that represent perm-fee errors from
the ShadowTransactionInvalid payload—for example
InsufficientPermFee/MissingPermFee-style variants) and apply the same narrowed
pattern to the other identical assertion so only perm-fee-related shadow
failures satisfy the test.

In `@crates/domain/src/models/execution_payload_cache.rs`:
- Around line 330-359: Wrap both the network request and the receiver await
under the same tokio::time::timeout so the configured wait_timeout applies to
the full request_payload_from_the_network(...) + receiver wait path (start the
Instant before calling request_payload_from_the_network and use
timeout(self.wait_timeout, async { request...; receiver.await })). On the
timeout branch clear the in-flight marker in
payloads_currently_requested_from_the_network for the given evm_block_hash and
also remove the orphaned sender from payload_senders (call pop(evm_block_hash))
so is_waiting_for_payload() becomes false; preserve returning
ExecutionPayloadWaitError::WaitTimeout with elapsed_ms.
- Around line 356-359: The numeric narrowing of started.elapsed().as_millis()
(u128) to u64 should use checked conversion instead of an as cast; replace
occurrences like the elapsed_ms: started.elapsed().as_millis() as u64 in the
ExecutionPayloadWaitError::WaitTimeout construction with a checked conversion
using u64::try_from(started.elapsed().as_millis()) and handle the Result (for
example with unwrap_or(u64::MAX) or propagating an error) and apply the same
change to all similar instances in this file.

In `@crates/p2p/src/block_status_provider.rs`:
- Around line 504-526: The test block_status_returns_in_tree_pending_validation
is missing the ChainState::NotOnchain(BlockState::ValidationScheduled) case —
add a new #[case(ChainState::NotOnchain(BlockState::ValidationScheduled))] to
the rstest parameterization so the test covers all four ChainState variants that
should map to InTreePendingValidation (refer to the function
block_status_returns_in_tree_pending_validation and the BlockState/ChainState
enums when adding the case).

In `@crates/types/src/config/node.rs`:
- Around line 731-740: The tests need to assert the new serialized default for
the execution payload timeout so refactors don't silently change behavior:
update the config serialization/legacy/testing tests that exercise defaults
(where other defaults are asserted) to include an assertion for the public field
execution_payload_wait_timeout_millis and/or its provider
default_execution_payload_wait_timeout_millis (expected value 60_000 or test
override 5_000 depending on the test case), and add equivalent checks in the
legacy/roundtrip config tests so both wire and runtime defaults are pinned.

---

Duplicate comments:
In `@crates/actors/src/shadow_tx_generator.rs`:
- Around line 267-271: The error returned when a transaction is both in the
publish ledger and has a perm_fee refund should be classified as structural, not
a snapshot invariant: update the return in shadow_tx_generator (where it
currently returns ShadowTxGenError::SnapshotInvariant(...)) to return
ShadowTxGenError::Structural(...) with the same descriptive message so malformed
peer block shape is routed as consensus/structural rejection rather than an
internal snapshot invariant failure.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0908b3aa-de0b-4649-ac73-8162f22a0de3

📥 Commits

Reviewing files that changed from the base of the PR and between 3293b75 and e1e03a3.

📒 Files selected for processing (25)

MAINNET_BETA.md
crates/actors/src/block_producer.rs
crates/actors/src/block_tree_service.rs
crates/actors/src/block_validation.rs
crates/actors/src/commitment_refunds.rs
crates/actors/src/metrics.rs
crates/actors/src/shadow_tx_generator.rs
crates/actors/src/validation_service.rs
crates/actors/src/validation_service/active_validations.rs
crates/actors/src/validation_service/block_validation_task.rs
crates/actors/tests/block_validation_tests.rs
crates/chain-tests/src/block_production/block_validation.rs
crates/chain-tests/src/multi_node/vdf_validation_progress.rs
crates/chain-tests/src/utils.rs
crates/chain-tests/src/validation/data_tx_pricing.rs
crates/chain-tests/src/validation/poa_cases.rs
crates/chain/src/chain.rs
crates/domain/src/models/block_index.rs
crates/domain/src/models/execution_payload_cache.rs
crates/p2p/src/block_pool.rs
crates/p2p/src/block_status_provider.rs
crates/p2p/src/tests/block_pool/mod.rs
crates/p2p/src/tests/integration/mod.rs
crates/p2p/src/tests/util.rs
crates/types/src/config/node.rs

coderabbitai · 2026-05-20T13:10:12Z

        self.request_payload_from_the_network(*evm_block_hash, request_only_from_trusted_peers)
            .await;
-        receiver.await.ok()
+        // Wrap the wait in `tokio::time::timeout` so a peer that advertises
+        // a block header but never serves the EVM payload cannot stall the
+        // validation task until the `payload_senders` LRU happens to evict
+        // our slot (essentially unbounded under low load). The bare
+        // `receiver.await` errors only when the matching oneshot sender is
+        // dropped (LRU eviction or explicit cache removal); surface that as
+        // `ReceiverDisrupted`, and surface a `timeout`-arm trip as the
+        // distinct `WaitTimeout` so operators can root-cause from logs.
+        // Both map to `ValidationError::ExecutionPayloadCacheEvicted`
+        // downstream — see `ExecutionPayloadWaitError` doc for the caller
+        // contract.
+        let started = std::time::Instant::now();
+        match tokio::time::timeout(self.wait_timeout, receiver).await {
+            Ok(Ok(sealed_block)) => Ok(sealed_block),
+            Ok(Err(_)) => Err(ExecutionPayloadWaitError::ReceiverDisrupted {
+                evm_block_hash: *evm_block_hash,
+            }),
+            Err(_) => {
+                // Pop the now-orphaned sender entry so it cannot leak into
+                // the 1000-slot LRU and displace legitimate waiters. Same
+                // cleanup the cache-hit fast path performs above; receiver
+                // has been consumed by `timeout` so there's nothing else to
+                // drop here.
+                self.payload_senders.write().await.pop(evm_block_hash);
+                Err(ExecutionPayloadWaitError::WaitTimeout {
+                    evm_block_hash: *evm_block_hash,
+                    elapsed_ms: started.elapsed().as_millis() as u64,
+                })


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make the timeout cover the full request+wait path, and clear the in-flight marker on expiry.

Right now request_payload_from_the_network(...).await can burn the full 10×5s retry budget before tokio::time::timeout(...) even starts, so this path can still block for roughly 110s. The timeout arm also leaves payloads_currently_requested_from_the_network populated, so is_waiting_for_payload() can remain true after the request loop and waiter have both already ended.

Suggested fix

- self.request_payload_from_the_network(*evm_block_hash, request_only_from_trusted_peers) - .await; + let started = std::time::Instant::now(); + self.request_payload_from_the_network(*evm_block_hash, request_only_from_trusted_peers) + .await; + let remaining = self.wait_timeout.saturating_sub(started.elapsed()); @@ - let started = std::time::Instant::now(); - match tokio::time::timeout(self.wait_timeout, receiver).await { + match tokio::time::timeout(remaining, receiver).await { Ok(Ok(sealed_block)) => Ok(sealed_block), Ok(Err(_)) => Err(ExecutionPayloadWaitError::ReceiverDisrupted { evm_block_hash: *evm_block_hash, }), Err(_) => { + { + let mut cache = self.cache.write().await; + cache + .payloads_currently_requested_from_the_network + .pop(evm_block_hash); + } self.payload_senders.write().await.pop(evm_block_hash); Err(ExecutionPayloadWaitError::WaitTimeout { evm_block_hash: *evm_block_hash, elapsed_ms: started.elapsed().as_millis() as u64, })

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/domain/src/models/execution_payload_cache.rs` around lines 330 - 359, Wrap both the network request and the receiver await under the same tokio::time::timeout so the configured wait_timeout applies to the full request_payload_from_the_network(...) + receiver wait path (start the Instant before calling request_payload_from_the_network and use timeout(self.wait_timeout, async { request...; receiver.await })). On the timeout branch clear the in-flight marker in payloads_currently_requested_from_the_network for the given evm_block_hash and also remove the orphaned sender from payload_senders (call pop(evm_block_hash)) so is_waiting_for_payload() becomes false; preserve returning ExecutionPayloadWaitError::WaitTimeout with elapsed_ms.

JesseTheRobot · 2026-05-20T22:53:42Z

@coderabbitai full review

coderabbitai · 2026-05-20T22:53:46Z

✅ Actions performed

Full review triggered.

…lure Expand PreValidationError::is_internal_failure() beyond the bootstrap InternalTaskJoin entry to cover every variant whose construction sites are unambiguously local/runtime failures: - DatabaseError — every site wraps an MDBX db.tx()/view() failure - PreviousTxInclusionsFailed — block_tree channel/actor failure - AddBlockFailed, UpdateCacheForScheduledValidationError — local in-memory block-tree cache mutation - ValidationServiceUnreachable — local actor channel dead - SystemTimeError — OS clock failure - RewardCurveError — local arithmetic on height-derived inputs (peer-supplied reward is checked separately via RewardMismatch) - FeeCalculationFailed — local arithmetic on config inputs (per-tx comparisons use the Insufficient*Fee variants) - EmaSnapshotError — local snapshot computation, no real failure path These previously routed through block_pool as `CriticalBlockPoolError:: BlockError`, which is peer-attributed bogus data. With this change they route through OtherInternal (peer is innocent) and the cached block is retained for retry via the path added for InternalTaskJoin. Also delete two dead variants flagged during the audit (TransactionFetchFailed in PreValidationError and CommitmentTransactionFetchFailed in ValidationError) which have zero construction sites in the workspace. Variants still requiring source-side disambiguation before adding (handled in subsequent commits): BlockEmaSnapshotNotFound, ParentEpochSnapshotNotFound, BlockBoundsLookupError.

…nternal BlockEmaSnapshotNotFound and ParentEpochSnapshotNotFound only fire from one construction site each, inside data_txs_are_valid, when the parent block's snapshot has fallen out of the in-memory block-tree window. This is a race against eviction (parent existed at prevalidation time and was verified there) — the block's validity is unknown locally, not bad. The old names suggested a consensus failure ("not found"). Rename to make the locality explicit and add doc comments capturing the race semantics. Classify both as is_internal_failure so block_pool routes them through OtherInternal (peer innocent) instead of BlockError. Field renamed from `block_hash` to `parent_hash` so the variant body unambiguously identifies which block's snapshot is missing.

BlockBoundsLookupError had three construction sites, one of which was overloaded: the PoA validation call to block_index.get_block_bounds could fail either because the peer's offset is past the chain (a consensus failure) or because of empty index / DB I/O / ledger missing (internal). Pre-check at the call site: read the latest block_index item's total_chunks for the ledger; if the peer-supplied ledger_chunk_offset is past it, reject as PoAChunkOffsetOutOfBlockBounds (existing consensus variant). Only after that pre-check do we call get_block_bounds, so any error it surfaces from this site is now unambiguously local. The other two construction sites (in get_assigned_ingress_proofs) feed block hashes already present in our own DB cache, so failures there are already pure DB consistency issues. Classify BlockBoundsLookupError as is_internal_failure now that every construction site is unambiguously local. Add a classifier test.

…a From Previously, every Err returned from the concurrent validation tasks (data_txs_are_valid, poa_is_valid, etc.) was wrapped into ValidationResult::Invalid by hand at the call site. That meant a local race — e.g. block_tree evicting the parent's snapshot mid-validation, surfacing as PreValidationError::LocalEmaSnapshotMissing — would mark the block as consensus-invalid and BlockTreeService would discard it. Make ValidationResult variant selection a single From impl: - ValidationError gains is_internal_failure(), delegating to PreValidationError::is_internal_failure() for the PreValidation variant and additionally flagging TaskPanicked, ValidationCancelled, ParentCommitmentSnapshotMissing, ParentEpochSnapshotMissing, ParentBlockMissing. - New ValidationResult::InternalFailure(ValidationError) variant. - From<ValidationError> for ValidationResult dispatches on the classifier; From<PreValidationError> chains through. - ValidationResult::metric_label() returns "valid"/"invalid"/"internal_error". Every call site in block_validation_task.rs that built a ValidationResult::Invalid is refactored to use .into() so the dispatch is automatic. Concurrent-stage fan-in prefers an explicit Invalid over an InternalFailure (consensus rejection is the stronger signal). BlockTreeService::on_block_validation_finished handles InternalFailure explicitly: log clearly, keep the block in cache, emit BlockStateUpdated with discarded=false. The block can be re-validated on next gossip or when the underlying cause clears. Tests cover both From paths and the metric_label dispatch.

…payload ValidationResult::InternalFailure previously accepted any ValidationError, relying on a SAFETY-CRITICAL doc comment and the From impl to enforce the invariant that only locally-classified errors live inside it. Nothing prevented manually writing ValidationResult::InternalFailure(ValidationError::ShadowTransactionInvalid(_)) — which would tell BlockTreeService "this consensus failure is just an internal blip, keep the block in cache and retry" and is exactly the mis-classification the rest of this branch is designed to prevent. Replace the variant payload with a sealed wrapper `InternalFailureError` whose only field is private and whose only construction path is the existing `From<ValidationError> for ValidationResult` impl (which checks `is_internal_failure()` before wrapping). Direct construction outside the module is now a compile error. Consumers read the underlying error via `inner.err()` and the wrapper Displays through to the inner. Updates: 7 consumer sites in block_validation_task.rs (sub-variant matching for metric labels + fan-in clone), 1 in block_tree_service.rs (uses Display directly). New test confirming `err()` exposes the expected inner. Also clean up a doc-string imprecision flagged in review: ValidationError::is_internal_failure delegates to (not "mirrors") PreValidationError::is_internal_failure for the PreValidation arm.

…ed BlockBoundsLookupError

…ft was only produced by 3 tests; ShadowTxGenerationFailed had no other production producers; both deleted along with the classifier arm and test usages

… ShadowTxGenerationFailed

…U pop to confirmed-success boundaries Bug 1: the pre-write-lock cache-miss early-return suppressed `BlockStateUpdated` on a Valid arrival — symmetric with the discard-path bug fixed in `1f85702fc`. Fix: emit `BlockStateUpdated{discarded:false, height:0, state:NotOnchain(Unknown)}` before the early-return, replacing the LATENT comment that described the problem. Test: update `valid_result_with_cache_miss_preserves_lru` to also assert the broadcast fires (assertion b). Bug 2: the post-write-lock second cache check (concurrent-prune race) returned silently after `mark_block_as_valid` had already succeeded. Fix: drop the write lock and emit `BlockStateUpdated` before the early-return so downstream waiters don't hang. Deferred test comment explains why direct reproduction requires a test seam not yet available. Bug 3: the recovery LRU pop + metric tick + `info!` fired before the write lock, meaning a spurious Valid arrival (cache miss) or `mark_block_as_valid` failure would pop the LRU and increment the counter incorrectly. Fix: move all three side effects to the two confirmed-success boundaries (no-new-tip branch and final path). The Bug 2 concurrent-prune path deliberately does NOT pop the LRU. Also: add `RecallRangeStepsUnavailable` arm to `soft_internal_reason_tag` (new SoftInternal variant added by parallel agent on this branch); add matching test case and include it in the unique-per-variant coverage list.

…in no-new-tip branch; fix harness to seed ValidationScheduled state The previous commit set Bug 3 success boundary 2 (final path) but was missing boundary 1 (the no-new-tip early-return branch where mark_block_as_valid succeeded but get_earliest_not_onchain_in_longest_chain returned None). Add the recovery LRU pop + metric + info! log there. Also fix DiscardHarness::insert_block_in_cache to call mark_block_as_validation_scheduled after add_block so that on_block_validation_finished can reach mark_block_as_valid successfully in tests.

… checks

…ed_ingress_proofs call The parallel agent that added cache_sender to get_assigned_ingress_proofs missed the tx_selector call site. Block-production path does not have ServiceSenders so None is the correct argument; stale block_set hashes are validation-concern only.

…sing recovery path When get_ledger_range returns Ok(None) for a hash stored in CachedDataRoot.block_set (pruned side-fork or predecessor missing), the hash was never removed from the set. Every re-delivery of the block would re-encounter the same dead hash and produce AssignedProofBlockMissing, parking the block forever. Fix: add PruneBlockHashFromBlockSet to CacheServiceAction. Validation sends it on Ok(None); the cache service dispatches a background thread (spawn_prune_block_hash_task, mirroring spawn_prune_txids_task) to retain()-remove the hash from the DB record. No mutation from the validation path — the cache service owns all CachedDataRoots writes.

…e constant The const _: fn() = || { ... } pattern does not trigger the dead_code lint in Rust, so the #[expect] attribute was itself generating an unfulfilled_lint_expectations warning. Remove the attribute; the compile-time exhaustiveness check works regardless.

Reformat lines touched by the recent refactors: - Task C tripwire match arm + trailing comment - Task D nested map_err / ok_or_else chain - Task E removed-variant trailing blank lines and test rewrite

- cargo fmt fix in cache_service.rs test introduced by the PruneBlockHashFromBlockSet test (a8e6c16). - Commit packing_service proptest regression seed; the file itself documents that these should be checked in.

C1: fix comment in gossip_data_handler.rs — referenced handle_block (non-existent); correct name is handle_block_header. C2: fix MAX_CONCURRENT_CANCEL_RETRIES doc — "Past three" implied the cap was exclusive; code gives up AT three; say "At three". C3: replace unchecked as_millis() as u64 narrowing with u64::try_from(...).unwrap_or(u64::MAX) in two sites in execution_payload_cache.rs (wait_for_payload + test helper). C4: add test_config_uses_short_payload_wait_timeout to run_mode_tests so a silent revert of execution_payload_wait_timeout_millis back to the 60s prod default surfaces as a test failure rather than a multi-minute hang. C5: document in discard_and_broadcast that the LRU + metric are set only for the parent block_hash, not for children swept by the recursive remove_block — this is intentional. C6: remove two false claims about AbortHandle::abort() stopping a spawn_blocking thread / preserving the panic→TaskPanicked path in block_validation_task.rs; replace with accurate descriptions pointing at the poa_abort_slot.set(...) site for design rationale.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

crates/actors/src/block_tree_service.rs (1)

1320-1365: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when remove_block fails.

If cache.remove_block(&block_hash) returns Err, this still falls through to a BlockStateUpdated { discarded: true, ... } broadcast below. That tells downstream consumers the block was discarded even though it may still be present in the cache, which can desynchronize retry/finalization logic.

Suggested fix

-        let cache_write_succeeded = if maybe_height.is_some() {
-            match cache.remove_block(&block_hash) {
-                Ok(()) => true,
-                Err(err) => {
-                    tracing::error!(block.hash = %block_hash, ?err, "Failed to remove block from cache");
-                    false
-                }
-            }
+        let cache_write_succeeded = if maybe_height.is_some() {
+            if let Err(err) = cache.remove_block(&block_hash) {
+                return Err(eyre::eyre!(
+                    "failed to remove discarded block {block_hash} from cache: {err}"
+                ));
+            }
+            true
         } else {
             tracing::debug!(block.hash = %block_hash, "Block already removed from cache (or never inserted; e.g. direct validation_service submission)");
             false
         };

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/actors/src/block_tree_service.rs` around lines 1320 - 1365, The code
currently logs errors from cache.remove_block(&block_hash) but continues to
broadcast BlockStateUpdated { discarded: true } even when the removal failed;
change the control flow in the function containing cache.remove_block so that on
Err(err) you fail fast: after logging the error (tracing::error!(...)), abort
processing for this block (e.g., return an Err or early-return/continue from the
caller) and do not construct or broadcast the BlockStateUpdated with discarded:
true; ensure this affects the cache_write_succeeded check and the subsequent
soft-internal LRU/metric logic so they only run when remove_block returned Ok.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/actors/src/cache_service.rs`:
- Around line 545-573: spawn_prune_block_hash_task currently spawns an unbounded
OS thread per prune request, leading to resource pressure; change it to enqueue
work on a serialized/bounded worker (reuse the existing
PruneTxidsFromCachedDataRoots queue/worker pattern) instead of calling
std::thread::spawn: create or use the component that processes a
single-threaded/prioritized queue, push a (data_root, block_hash) task into that
queue, and have the worker call the same db.update_eyre closure with
cached_data_root_by_data_root and db_tx.put::<CachedDataRoots> to perform the
prune and logging; ensure the queue is bounded/backpressured and that errors are
logged the same way as the current warn block.

In `@crates/actors/src/validation_service.rs`:
- Around line 424-429: The call to vdf_terminal_finalize_via(...) ignores its
bool return, so when the block-tree channel is closed the Invalid/ParentMissing
terminal verdicts get dropped without executing the existing fallback logic
(record_validation_finished / record_block_validation_error); update the
ParentMissing and other terminal arms (locations using vdf_terminal_finalize_via
in validation_service.rs, including the blocks around lines shown) to check the
returned bool and if false run the same fallback handling (call
record_validation_finished or record_block_validation_error as appropriate), or
refactor vdf_terminal_finalize_via to return/propagate its success and perform
the fallback internally so callers cannot ignore the failure.

---

Duplicate comments:
In `@crates/actors/src/block_tree_service.rs`:
- Around line 1320-1365: The code currently logs errors from
cache.remove_block(&block_hash) but continues to broadcast BlockStateUpdated {
discarded: true } even when the removal failed; change the control flow in the
function containing cache.remove_block so that on Err(err) you fail fast: after
logging the error (tracing::error!(...)), abort processing for this block (e.g.,
return an Err or early-return/continue from the caller) and do not construct or
broadcast the BlockStateUpdated with discarded: true; ensure this affects the
cache_write_succeeded check and the subsequent soft-internal LRU/metric logic so
they only run when remove_block returned Ok.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4ab39477-5fbd-4e69-b69d-d0c8b411f610

📥 Commits

Reviewing files that changed from the base of the PR and between d455fec and e53586a.

📒 Files selected for processing (13)

crates/actors/proptest-regressions/packing_service/mod.txt
crates/actors/src/block_tree_service.rs
crates/actors/src/block_validation.rs
crates/actors/src/cache_service.rs
crates/actors/src/shadow_tx_generator.rs
crates/actors/src/tx_selector/mod.rs
crates/actors/src/validation_service.rs
crates/actors/src/validation_service/active_validations.rs
crates/actors/src/validation_service/block_validation_task.rs
crates/domain/src/models/execution_payload_cache.rs
crates/p2p/src/block_pool.rs
crates/p2p/src/gossip_data_handler.rs
crates/types/src/config/node.rs

coderabbitai · 2026-05-21T12:11:49Z

+    fn spawn_prune_block_hash_task(&self, data_root: DataRoot, block_hash: H256) {
+        let clone = self.clone();
+        std::thread::spawn(move || {
+            let result = clone.db.update_eyre(|db_tx| {
+                let Some(mut cached) = cached_data_root_by_data_root(db_tx, data_root)? else {
+                    return Ok(());
+                };
+                let before_len = cached.block_set.len();
+                cached.block_set.retain(|h| h != &block_hash);
+                if cached.block_set.len() < before_len {
+                    debug!(
+                        data_root = %data_root,
+                        block_hash = %block_hash,
+                        "Pruned stale block hash from CachedDataRoot.block_set"
+                    );
+                    db_tx.put::<CachedDataRoots>(data_root, cached)?;
+                }
+                Ok(())
+            });
+            if let Err(e) = result {
+                warn!(
+                    data_root = %data_root,
+                    block_hash = %block_hash,
+                    "Failed to prune stale block hash from CachedDataRoot.block_set: {}",
+                    e
+                );
+            }
+        });
+    }


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Unbounded thread spawning in block-hash pruning path can cause resource pressure.

Line 1023 dispatches every prune request straight into spawn_prune_block_hash_task, and Line 547 creates a fresh OS thread per request. Unlike the pruning/epoch/txid paths, this has no queue/backpressure/completion gating, so bursts of stale-hash events can create unbounded concurrent threads.

Please route this action through a serialized queue (same pattern as PruneTxidsFromCachedDataRoots) or a bounded worker path.

Suggested direction

pub struct ChunkCacheService { ... + block_hash_prune_running: bool, + block_hash_prune_queue: VecDeque<(DataRoot, H256)>, } pub enum CacheServiceAction { ... + PruneBlockHashCompleted, } CacheServiceAction::PruneBlockHashFromBlockSet { data_root, block_hash } => { - self.cache_task.spawn_prune_block_hash_task(data_root, block_hash); + self.block_hash_prune_queue.push_back((data_root, block_hash)); + if !self.block_hash_prune_running + && let Some((dr, bh)) = self.block_hash_prune_queue.pop_front() + { + self.block_hash_prune_running = true; + self.cache_task.spawn_prune_block_hash_task(dr, bh); // send PruneBlockHashCompleted on finish + } }

Also applies to: 1023-1029

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/actors/src/cache_service.rs` around lines 545 - 573, spawn_prune_block_hash_task currently spawns an unbounded OS thread per prune request, leading to resource pressure; change it to enqueue work on a serialized/bounded worker (reuse the existing PruneTxidsFromCachedDataRoots queue/worker pattern) instead of calling std::thread::spawn: create or use the component that processes a single-threaded/prioritized queue, push a (data_root, block_hash) task into that queue, and have the worker call the same db.update_eyre closure with cached_data_root_by_data_root and db_tx.put::<CachedDataRoots> to perform the prune and logging; ensure the queue is bounded/backpressured and that errors are logged the same way as the current warn block.

coderabbitai · 2026-05-21T12:11:49Z

+                                vdf_terminal_finalize_via(
+                                    &mut coordinator,
+                                    &self.inner.service_senders.block_tree,
+                                    hash,
+                                    ValidationError::VdfValidationFailed(vdf_error.to_string()),
+                                );


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle failed VDF terminal-result dispatches.

These terminal arms ignore the bool from vdf_terminal_finalize_via(). If the block-tree channel is closed, Invalid/ParentMissing verdicts are dropped without the fallback record_validation_finished / record_block_validation_error handling you already apply in the concurrent panic and retry-cap paths.

💡 Minimal follow-up

- vdf_terminal_finalize_via( + if !vdf_terminal_finalize_via( &mut coordinator, &self.inner.service_senders.block_tree, hash, ValidationError::VdfValidationFailed(vdf_error.to_string()), - ); + ) { + self.inner.chain_sync_state.record_validation_finished(&hash); + self.inner.chain_sync_state.record_block_validation_error( + format!("block={} error=vdf validation failed: {}", hash, vdf_error), + ); + }

Mirror the same fallback in the ParentMissing arm, or fold that fallback into vdf_terminal_finalize_via(...) so the helper cannot be ignored.

Also applies to: 455-461, 899-907

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/actors/src/validation_service.rs` around lines 424 - 429, The call to vdf_terminal_finalize_via(...) ignores its bool return, so when the block-tree channel is closed the Invalid/ParentMissing terminal verdicts get dropped without executing the existing fallback logic (record_validation_finished / record_block_validation_error); update the ParentMissing and other terminal arms (locations using vdf_terminal_finalize_via in validation_service.rs, including the blocks around lines shown) to check the returned bool and if false run the same fallback handling (call record_validation_finished or record_block_validation_error as appropriate), or refactor vdf_terminal_finalize_via to return/propagate its success and perform the fallback internally so callers cannot ignore the failure.

…reason diagnostic clarity `classify_poa_join_error` previously returned `ValidationCancelReason::ChannelClosed` to mean "PoA spawn_blocking handle was externally aborted" (sibling stage produced a verdict first, runtime tear-down, etc.). Routing was correct — both are SoftInternal per `IS_INTERNAL = true` — but the shared variant conflated two distinct local-fault modes in operator log greps and the `validation_cancelled` metric label. Adds `ValidationCancelReason::PoAAborted` so logs and dashboards can tell apart "broadcast channel closed at shutdown" from "PoA blocking task aborted mid-flight". Updates the compile-time exhaustive-check tripwire, `Display`, the metric_label / is_node_fault loops, and both `rstest` enumerations. No routing/dispatch behaviour change. Addresses M1 in the branch review.

…e_fault mapping The prior `TODO(P2 follow-up): differentiate DB I/O from logic errors` at the `calculate_expired_ledger_fees` call site suggested splitting failures into typed DB-vs-logic sub-errors. On audit (2026-05-21) the function takes no peer-derived input — every parameter is locally derived (parent epoch snapshot, validator- computed block height, local block index, local mempool, local DB). The output is what the validator EXPECTS the peer's shadow txs to look like; peer comparison happens downstream. Every failure path is therefore a node-side fault: - MDBX I/O failure → NodeFault. - Local snapshot/index inconsistency → still NodeFault (two honest nodes cannot disagree on local snapshot state; same rationale as `SnapshotInvariant`). - Internal arithmetic over local data → NodeFault. No peer-attributable mode exists, so the blanket `node_fault` mapping is correct. Replaces the TODO with the audit rationale. Addresses M4 in the branch review (correction — finding was a false positive).

…re_waiters The test's name claims it pins the orphan invariant — that after `waiter_a` is dropped, the sender entry persists so `waiter_b` subscribes to the SAME notifier rather than racing a fresh one. The previous assertions only validated the delivery-time pop; if a future regression makes `Receiver::drop` evict the sender, the test would still pass (waiter_b subscribes to a freshly-created notifier, the test payload still wakes it). Adds an explicit `payload_senders.contains` assertion after `drop(waiter_a)` and before `waiter_b` subscribes, with a comment naming the regression class. Addresses M5 in the branch review.

The test helper does a single cache check before `block_receiver(...)`, then immediately enters `changed().await`. If a payload arrives between the cache check and the subscription, the wake-side has already popped + dropped the sender; the freshly-subscribed `Receiver` sees a fresh channel with no pending value and waits until timeout even though the payload is already sitting in `cache.payloads`. Mirrors the double cache check pattern that the production `wait_for_sealed_block` already uses (the inline comment there explicitly names the race), with a comment pointing back to that rationale.

`read_block_from_state` previously returned `ReceiverLagged` on any broadcast overflow. Under concurrent test load the channel can lag for reasons unrelated to `block_hash` (e.g. a different block producing a burst of events); returning eagerly turns successful validations into phantom `ReceiverLagged` outcomes in busy multi-node tests. On `Lagged(missed)`, re-poll `block_tree_guard.get_block_and_status(block_hash)` before surfacing the lag: * Terminal non-Scheduled state → return `StoredOnNode(chain_state)` (block survived; lag was unrelated). * Absent AND we'd previously seen it scheduled → return `NoDiscardEventObserved` (matches the outer-poll fall-through; we know it was discarded, just can't recover the error). * Still scheduled, or absent without prior scheduled sighting → fall through to `ReceiverLagged { missed }` (the discard event truly may have been among the missed messages).

Removes inline references to review-pass identifiers (H1-H6 / M1-M6 / L1-L8 / C1-C2 severity tags, R2/R7 review-round tags, "Bug 1/2/3" labels, "Post-M2" phrasing, dated "2026-05-20 audit" markers, and commit-hash citations like `c3fd8963a` / `1f85702fc` / `ca192a999` / `4a0d407cf` / `10bb8c222`). The information each tag encoded is preserved by describing the condition / bug class / invariant in plain terms. What stays (intentional cross-branch coordination): * `MERGE-BLOCKER(C1)`, `MERGE-PAIR(C1)`, `MERGE-NOTE(C2)` and the test sites that reference them. The merge agent uses the literal `C1` tag to grep for all sites tied to the off-tree PoA lookup fix gated on the `fix-cdr-block-set` merge. * `PHASE-B` operator-action label (drops the (H3) suffix). Also folds in two small follow-on tweaks that fit the topic: * Adds the `PoAAborted` rstest case to the `validation_cancelled_softinternal_skips_lru_and_counter` table and the `validation_cancelled_converts_per_reason` loop (closes the test-array coverage gap that the prior PoAAborted commit missed). * Re-words the `irys.block.soft_internal_discard_total` metric description to drop "Phase A of H3 instrumentation" in favour of a self-contained description of the recovery-gauge contract. No production-logic changes. All affected tests still pass.

…s peer-attributable The publish-ledger ingress-proof-reward underflow path previously upgraded to SnapshotTreasuryUnderflow (node fault → panic+restart) whenever prior expired-ledger payouts or commitment refunds had mutated treasury_balance. Since those snapshot-derived inputs are deterministic from canonical state, every honest validator with the same parent reaches the same running balance — so an underflow on a peer-supplied operand is uniformly peer-attributable, not snapshot-drift evidence. The conservative upgrade created a network-wide DoS surface: a single crafted epoch block could crash every validator simultaneously. Snapshot drift remains caught by the per-block `block.treasury != expected_treasury` comparison in block_validation.rs, so the loud-restart signal is not lost. Removes the now-unused `snapshot_tainted` field, inverts the existing regression test (renamed `..._is_snapshot_underflow` → `..._is_peer_attributable`), and updates the ShadowTxGenError variant docstrings to reflect that SnapshotTreasuryUnderflow is now constructed only for snapshot-derived operands (via deduct_from_treasury_for_payout).

The orphan-resolve branch in process_block sent AttemptReprocessingBlock raw whenever the parent was cached and not flagged is_processing, bypassing the per-block reprocessing cooldown that the gossip path enforces. A burst of orphan children for one parked parent could therefore enqueue N raw reprocess messages before the first flipped is_processing — defeating the rate guard added in dedup_and_maybe_emit_reprocess. Replace the is_block_processing-gated branch with a single call to the shared dedup/cooldown helper. KnownInFlight still suppresses the send, ParkedReadyForRetry still emits and stamps the cooldown, and ParkedCoolingDown now correctly suppresses the storm. Adds a regression test that drives the orphan-child path against a parent in the cooldown window and asserts no second emission.

…unter classify_poa_join_error's cancel branch already routes through the dedicated PoAAborted variant for log/diagnostic clarity, but the record_validation_cancellation counter — fired explicitly at the height-diff, parent-missing, and channel-closed sites — was never called on this path. Dashboards therefore could not attribute cancel volume to the PoA-abort lane. Add the counter call inside the cancel branch so the metric half of PoAAborted's split lands alongside the log half. Routing semantics are unchanged; same SoftInternal outcome.

The prior comment claimed snapshot drift would surface via the downstream block.treasury != expected_treasury comparison. That is only true when iteration completes far enough to reach the comparison — a snapshot-drifted balance below the peer-supplied operand underflows mid-iteration and mis-discards the block as Invalid on the drifted node alone, before the post-loop comparison runs. Rewrite the comment to acknowledge the gap while preserving the classification rationale: single-node self-DoS is strictly preferable to network-wide loud-restart on a crafted block, so the peer-attributable classification stays correct. No code change.

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread crates/actors/src/block_validation.rs Outdated

Comment thread crates/actors/src/validation_service/block_validation_task.rs Outdated

Comment thread crates/chain-tests/src/block_production/block_validation.rs Outdated

JesseTheRobot force-pushed the fix/validation-error-disamb branch from d300e7c to 3125039 Compare May 13, 2026 20:50

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread crates/actors/src/block_tree_service.rs Outdated

Comment thread crates/actors/src/validation_service/block_validation_task.rs

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

JesseTheRobot force-pushed the fix/validation-error-disamb branch from 9c13980 to d52dad3 Compare May 14, 2026 12:39

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

JesseTheRobot force-pushed the fix/validation-error-disamb branch from e508db6 to 3e913fb Compare May 14, 2026 14:40

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

Comment thread crates/actors/src/block_validation.rs Outdated

Comment thread crates/actors/src/block_validation.rs

Comment thread crates/actors/src/validation_service.rs Outdated

JesseTheRobot commented May 14, 2026

View reviewed changes

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

Comment thread crates/actors/src/block_validation.rs

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

JesseTheRobot force-pushed the fix/validation-error-disamb branch from 3ed807a to 425f23d Compare May 19, 2026 18:33

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread crates/actors/src/shadow_tx_generator.rs

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

JesseTheRobot added 5 commits May 20, 2026 22:54

JesseTheRobot added 13 commits May 21, 2026 10:47

refactor(validation): route parent-block DB lookup panics through typ…

690a396

…ed BlockBoundsLookupError

refactor(validation): remove dead ShadowTxGenError::Soft variant — So…

c63e8c9

…ft was only produced by 3 tests; ShadowTxGenerationFailed had no other production producers; both deleted along with the classifier arm and test usages

refactor(validation): clean up comment and test references to removed…

3852b24

… ShadowTxGenerationFailed

style(validation): fix clippy warnings introduced by Tasks B and C

48b27b0

style(validation): fix clippy assertions_on_constants for IS_INTERNAL…

fbefa6b

… checks

style(validation): cargo fmt fixes from Tasks C/D/E

85358b7

Reformat lines touched by the recent refactors: - Task C tripwire match arm + trailing comment - Task D nested map_err / ok_or_else chain - Task E removed-variant trailing blank lines and test rewrite

chore(validation): apply cargo fmt + check in proptest regression seed

3b1c61e

- cargo fmt fix in cache_service.rs test introduced by the PruneBlockHashFromBlockSet test (a8e6c16). - Commit packing_service proptest regression seed; the file itself documents that these should be checked in.

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

JesseTheRobot added 11 commits May 21, 2026 13:25

chore: fix typos flagged by typos lint

52b63b5

JesseTheRobot changed the base branch from master to fix/validation May 21, 2026 17:55

JesseTheRobot merged commit 2238fa3 into fix/validation May 21, 2026
18 checks passed

JesseTheRobot deleted the fix/validation-error-disamb branch May 21, 2026 19:34

JesseTheRobot mentioned this pull request May 21, 2026

fix: validation #1425

Merged

3 tasks

coderabbitai Bot mentioned this pull request May 26, 2026

feat: obs follow-ups and Reth metrics bind hardening #1426

Merged

3 tasks

Conversation

JesseTheRobot commented May 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JesseTheRobot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JesseTheRobot commented May 14, 2026

Uh oh!

coderabbitai Bot commented May 14, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

JesseTheRobot commented May 19, 2026

Uh oh!

JesseTheRobot commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JesseTheRobot commented May 20, 2026

Uh oh!

coderabbitai Bot commented May 20, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JesseTheRobot commented May 20, 2026

Uh oh!

coderabbitai Bot commented May 20, 2026

Uh oh!

JesseTheRobot commented May 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading