fix: validation error disambiguation#1416
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughSeparate internal/node-fault vs consensus validation outcomes; thread typed cancellations and payload-wait errors through validation tasks and services; centralize discard/broadcast with soft-internal LRU + recovery logging; switch payload-wait to watch-based notifiers; add P2P parked-block dedup/reprocess and many test/config updates. ChangesValidation core and API
Validation orchestration
Block tree behavior
Shadow tx, refunds, metrics
Domain & cache
P2P & block pool
Tests and call-site updates
Possibly related PRs
Suggested reviewers
Estimated code review effort 🎯 4 (Complex) | ⏱️ ~45 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/actors/src/block_validation.rs`:
- Around line 1421-1441: The code currently falls through to get_block_bounds
when latest.ledgers has no entry for ledger, causing missing-ledger to be
treated as an internal BlockBoundsLookupError; change the early disambiguation
in the block_index_guard read (the block where latest.ledgers.iter().find(...)
is used) to explicitly handle the None case for the ledger: if the find(...)
returns None, return a consensus-invalid PreValidationError (either reuse
PreValidationError::PoAChunkOffsetOutOfBlockBounds or add a new variant like
PreValidationError::PoALedgerNotPresent) instead of continuing to
get_block_bounds; keep the existing check that returns
PoAChunkOffsetOutOfBlockBounds when ledger_chunk_offset >= chain_max for
Some(chain_max).
In `@crates/actors/src/validation_service/block_validation_task.rs`:
- Around line 57-62: The validate_block path still re-reads the parent's epoch
snapshot from the block tree which reintroduces the eviction race; modify
validate_block to use the captured self.parent_epoch_snapshot (the
Arc<EpochSnapshot> field) everywhere it's needed (not only in
data_txs_are_valid) — specifically replace calls to
block_tree.get_epoch_snapshot()/similar lookups before PoA checks, shadow-tx
validation, and when computing cascade_active; derive cascade_active from the
captured snapshot via
config.consensus.hardforks.is_cascade_active_for_epoch(&self.parent_epoch_snapshot)
so the validator uses the same activation rule as the producer and avoids
ParentEpochSnapshotMissing races.
In `@crates/chain-tests/src/block_production/block_validation.rs`:
- Around line 436-450: The test currently computes cascade_active from
tree.canonical_epoch_snapshot(), causing divergence at epoch boundaries;
instead, derive cascade_active from the parent epoch snapshot you already
fetched: call
config_override.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch)
(or &parent_epoch_snapshot) after obtaining parent_epoch via
tree.get_epoch_snapshot(&parent_hash) and replace the use of
tree.canonical_epoch_snapshot() with this parent-based check so cascade_active
aligns with the parent_epoch_snapshot used elsewhere in the block validation
test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 39f6ce7f-60cf-46ee-8775-7fe15e27f992
📒 Files selected for processing (6)
crates/actors/src/block_tree_service.rscrates/actors/src/block_validation.rscrates/actors/src/validation_service.rscrates/actors/src/validation_service/block_validation_task.rscrates/chain-tests/src/block_production/block_validation.rscrates/chain-tests/src/validation/mod.rs
d300e7c to
3125039
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/actors/src/validation_service/block_validation_task.rs (1)
683-689:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftDon't map every shadow-tx stage failure to
ShadowTransactionInvalid.
shadow_transactions_are_valid()returnseyre::Result, so this branch currently turns payload-cache misses, DB/view failures, and missing parent lookups intoValidationError::ShadowTransactionInvalid..into()then classifies all of them as consensus-invalid, which reintroduces peer-attribution for local failures.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/actors/src/validation_service/block_validation_task.rs` around lines 683 - 689, The current match on shadow_tx_result treats any eyre::Result error as ShadowTransactionInvalid; instead inspect the underlying error returned by shadow_transactions_are_valid() (use err.downcast_ref::<...>() or pattern-match known error enums from the payload cache / DB / parent lookup) and only map genuine transaction-invalid errors to ValidationError::ShadowTransactionInvalid; map cache-miss / DB/view failures / missing-parent cases to a transient/internal validation result (e.g., ValidationError::InternalError or a transient variant you have) so local failures are not classified as consensus-invalid. Reference shadow_tx_result, shadow_transactions_are_valid(), and ValidationError::ShadowTransactionInvalid when updating the match arm.
♻️ Duplicate comments (2)
crates/actors/src/validation_service/block_validation_task.rs (1)
384-400:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftReuse one captured parent snapshot set across the whole validation pass.
validate_block()still re-reads the parent epoch snapshot before PoA and again before shadow/cascade activation, anddata_txs_are_valid()is still invoked without the parent snapshots/EMA. That means a valid block can still fail later withParentEpochSnapshotMissing/Local*SnapshotMissingif the parent is evicted after scheduling. This change reclassifies the race, but it does not eliminate it.Based on learnings: ensure the validator uses the same activation rule as the producer by deriving
cascade_activefrom theparent_epoch_snapshotviaconfig.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch_snapshot).Also applies to: 477-499, 633-641
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/actors/src/validation_service/block_validation_task.rs` around lines 384 - 400, The code currently re-reads the parent epoch snapshot multiple times in validate_block() (the get_epoch_snapshot call that assigns parent_epoch_snapshot) and then later before PoA and before shadow/cascade activation, which can race if the parent is evicted; change validate_block() to capture the parent_epoch_snapshot once (the value returned by block_tree_guard.read().get_epoch_snapshot(&block.previous_block_hash)) and reuse that same snapshot for all subsequent checks and for computing cascade_active using config.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch_snapshot); also pass this captured parent_epoch_snapshot (and derived EMA/activation flag) into data_txs_are_valid() instead of re-reading the snapshot; apply the same single-capture-and-reuse fix to the other occurrences referenced (the similar get_epoch_snapshot usages around the 477-499 and 633-641 regions).crates/actors/src/block_validation.rs (1)
1480-1495:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTreat an absent latest-ledger entry as consensus-invalid too.
This pre-check only rejects
Some(chain_max) if ledger_chunk_offset >= chain_max. Iflatest.ledgershas no entry forledger, the code still falls through toget_block_bounds(...), and that error is now classified asBlockBoundsLookupError/internal. A PoA targeting a ledger that is not present on this chain will therefore be retried instead of rejected.Suggested direction
let bb = { let index = block_index_guard.read(); - if let Some(latest) = index.get_latest_item() - && let Some(chain_max) = latest - .ledgers - .iter() - .find(|l| l.ledger == ledger) - .map(|l| l.total_chunks) - && ledger_chunk_offset >= chain_max - { - return Err(PreValidationError::PoAChunkOffsetOutOfBlockBounds); + if let Some(latest) = index.get_latest_item() { + match latest + .ledgers + .iter() + .find(|l| l.ledger == ledger) + .map(|l| l.total_chunks) + { + Some(chain_max) if ledger_chunk_offset >= chain_max => { + return Err(PreValidationError::PoAChunkOffsetOutOfBlockBounds); + } + None => { + return Err(PreValidationError::PoAChunkOffsetOutOfBlockBounds); + } + _ => {} + } } index .get_block_bounds(ledger, LedgerChunkOffset::from(ledger_chunk_offset)) .map_err(|e| PreValidationError::BlockBoundsLookupError(e.to_string()))? };🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/actors/src/block_validation.rs` around lines 1480 - 1495, The check currently only rejects when latest.ledgers contains the ledger and the offset is out-of-bounds, but if latest.ledgers has no entry for ledger we should also reject as consensus-invalid instead of falling through to get_block_bounds; modify the block starting with let index = block_index_guard.read(); to explicitly treat the case where latest exists but latest.ledgers.iter().find(|l| l.ledger == ledger) is None as Err(PreValidationError::PoAChunkOffsetOutOfBlockBounds) (i.e. return that error before calling index.get_block_bounds), keeping the existing early-return when ledger_chunk_offset >= chain_max and preserving the existing get_block_bounds(...) call for the remaining path. Use the symbols index, latest, latest.ledgers.iter().find(...), ledger, ledger_chunk_offset, and PreValidationError::PoAChunkOffsetOutOfBlockBounds to locate and change the logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/actors/src/block_tree_service.rs`:
- Around line 415-457: The InternalFailure branch in
on_block_validation_finished leaves the block marked as scheduled/validating and
never re-enqueues it, which can permanently stall the block and its descendants;
update this branch to clear the "validation scheduled" flag and re-schedule
validation for transient internal failures: after computing height/state (using
cache.get_block and cache.get_block_and_status as shown) call the routine that
clears the scheduled validation marker (e.g.,
clear_scheduled_validation(block_hash) or unset the scheduled flag in the cache
entry) and then enqueue the block for validation again (e.g.,
schedule_validation(block_hash) or push into the validation queue) so that
on_block_prevalidated will not short-circuit future retries; keep the existing
BlockStateUpdated event and logging, and ensure any cache writes use the same
locking/error handling pattern as elsewhere to avoid deadlocks.
---
Outside diff comments:
In `@crates/actors/src/validation_service/block_validation_task.rs`:
- Around line 683-689: The current match on shadow_tx_result treats any
eyre::Result error as ShadowTransactionInvalid; instead inspect the underlying
error returned by shadow_transactions_are_valid() (use err.downcast_ref::<...>()
or pattern-match known error enums from the payload cache / DB / parent lookup)
and only map genuine transaction-invalid errors to
ValidationError::ShadowTransactionInvalid; map cache-miss / DB/view failures /
missing-parent cases to a transient/internal validation result (e.g.,
ValidationError::InternalError or a transient variant you have) so local
failures are not classified as consensus-invalid. Reference shadow_tx_result,
shadow_transactions_are_valid(), and ValidationError::ShadowTransactionInvalid
when updating the match arm.
---
Duplicate comments:
In `@crates/actors/src/block_validation.rs`:
- Around line 1480-1495: The check currently only rejects when latest.ledgers
contains the ledger and the offset is out-of-bounds, but if latest.ledgers has
no entry for ledger we should also reject as consensus-invalid instead of
falling through to get_block_bounds; modify the block starting with let index =
block_index_guard.read(); to explicitly treat the case where latest exists but
latest.ledgers.iter().find(|l| l.ledger == ledger) is None as
Err(PreValidationError::PoAChunkOffsetOutOfBlockBounds) (i.e. return that error
before calling index.get_block_bounds), keeping the existing early-return when
ledger_chunk_offset >= chain_max and preserving the existing
get_block_bounds(...) call for the remaining path. Use the symbols index,
latest, latest.ledgers.iter().find(...), ledger, ledger_chunk_offset, and
PreValidationError::PoAChunkOffsetOutOfBlockBounds to locate and change the
logic.
In `@crates/actors/src/validation_service/block_validation_task.rs`:
- Around line 384-400: The code currently re-reads the parent epoch snapshot
multiple times in validate_block() (the get_epoch_snapshot call that assigns
parent_epoch_snapshot) and then later before PoA and before shadow/cascade
activation, which can race if the parent is evicted; change validate_block() to
capture the parent_epoch_snapshot once (the value returned by
block_tree_guard.read().get_epoch_snapshot(&block.previous_block_hash)) and
reuse that same snapshot for all subsequent checks and for computing
cascade_active using
config.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch_snapshot);
also pass this captured parent_epoch_snapshot (and derived EMA/activation flag)
into data_txs_are_valid() instead of re-reading the snapshot; apply the same
single-capture-and-reuse fix to the other occurrences referenced (the similar
get_epoch_snapshot usages around the 477-499 and 633-641 regions).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: dbf07fde-a385-4341-859b-b2525f1dc16b
📒 Files selected for processing (3)
crates/actors/src/block_tree_service.rscrates/actors/src/block_validation.rscrates/actors/src/validation_service/block_validation_task.rs
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/actors/src/validation_service/block_validation_task.rs (1)
684-688:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftDon't collapse every shadow-tx failure into
ShadowTransactionInvalid.
shadow_transactions_are_valid()also fails for local/runtime conditions such as missing execution payloads, DB/block-tree lookup failures, and parent-snapshot eviction. Converting everyErrhere into a consensus rejection defeats the newInternalFailuresplit and can still discard valid blocks.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/actors/src/validation_service/block_validation_task.rs` around lines 684 - 688, The current match on shadow_tx_result collapses all failures into ValidationError::ShadowTransactionInvalid; instead inspect the error returned by shadow_transactions_are_valid (shadow_tx_result) and match its specific variants so that transient/local/runtime failures (e.g. missing execution payloads, DB/block-tree lookup failures, parent-snapshot eviction) are returned as an InternalFailure variant while only true consensus-rejection errors map to ValidationError::ShadowTransactionInvalid; update the match arm handling Err(err) to distinguish error kinds (using the error enum/variants from shadow_transactions_are_valid or helper methods on the error) and return ValidationError::InternalFailure(...) for those local/runtime cases and ValidationError::ShadowTransactionInvalid(err.to_string()) only for genuine shadow-tx validation rejections.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/actors/src/validation_service/block_validation_task.rs`:
- Around line 684-688: The current match on shadow_tx_result collapses all
failures into ValidationError::ShadowTransactionInvalid; instead inspect the
error returned by shadow_transactions_are_valid (shadow_tx_result) and match its
specific variants so that transient/local/runtime failures (e.g. missing
execution payloads, DB/block-tree lookup failures, parent-snapshot eviction) are
returned as an InternalFailure variant while only true consensus-rejection
errors map to ValidationError::ShadowTransactionInvalid; update the match arm
handling Err(err) to distinguish error kinds (using the error enum/variants from
shadow_transactions_are_valid or helper methods on the error) and return
ValidationError::InternalFailure(...) for those local/runtime cases and
ValidationError::ShadowTransactionInvalid(err.to_string()) only for genuine
shadow-tx validation rejections.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d715b1f8-4191-4ad7-9763-fbc8d2bef4f7
📒 Files selected for processing (3)
crates/actors/src/block_tree_service.rscrates/actors/src/block_validation.rscrates/actors/src/validation_service/block_validation_task.rs
9c13980 to
d52dad3
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/actors/src/validation_service/block_validation_task.rs (1)
530-539:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftDon’t classify every shadow-tx error as consensus-invalid.
This stage still turns every
eyrefailure intoValidationResult::Invalid(ShadowTransactionInvalid). That includes non-consensus failures fromget_commitment_snapshot(...), execution-payload provider calls, and other runtime errors, so transient/internal failures will still discard blocks and the"shadow_tx"metric is forced into the"invalid"bucket. Have the shadow-tx stage return a typedValidationResult(or typed error) so only actual rule violations becomeInvalid; propagate infra/runtime failures asInternalFailure.Also applies to: 562-572, 706-712
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/actors/src/validation_service/block_validation_task.rs` around lines 530 - 539, The shadow-tx validation code currently maps every eyre failure into ValidationResult::Invalid(ShadowTransactionInvalid) (e.g., the block_tree_guard.read().get_commitment_snapshot(...) call and similar calls later), which wrongly treats infra/runtime errors as consensus violations; change the shadow-tx stage to return a typed Result<ValidationResult, ShadowTxError> (or a ValidationResult enum that distinguishes Invalid(reason) vs InternalFailure(error)), update callers of validate_shadow_transactions / the shadow-tx stage to propagate InternalFailure for non-rule errors (including errors from get_commitment_snapshot, execution-payload provider calls, and other runtime failures) and only produce ValidationResult::Invalid(ShadowTransactionInvalid) for explicit consensus rule violations; ensure error construction preserves the original eyre error for InternalFailure so logs/metrics can classify transient infra failures separately.
♻️ Duplicate comments (1)
crates/actors/src/validation_service/block_validation_task.rs (1)
407-423:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCapture the parent epoch snapshot once and reuse it across validation.
validate_block()still re-reads the parent epoch snapshot from the live block tree for PoA and then again for shadow-tx/cascade. If that parent is pruned after scheduling, a valid block can still fail withParentEpochSnapshotMissing. Capture the snapshot once before spawning the stage tasks and reuse that sameArcfor PoA, shadow-tx validation, andcascade_active.Based on learnings: ensure the validator uses the same activation rule as the producer by deriving
cascade_activefrom theparent_epoch_snapshotviaconfig.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch_snapshot).Also applies to: 500-522
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/actors/src/validation_service/block_validation_task.rs` around lines 407 - 423, Capture the parent epoch snapshot once at the start of validate_block() by calling block_tree_guard.read().get_epoch_snapshot(&block.previous_block_hash) and storing the returned Arc as parent_epoch_snapshot (reuse the Some(snapshot) branch value) instead of re-reading it later; then pass that same parent_epoch_snapshot Arc into the PoA validation, shadow-tx/cascade validation tasks and derive cascade_active from it via config.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch_snapshot) so all stages use the identical snapshot and activation rule (replace subsequent calls that re-query the block tree with the stored parent_epoch_snapshot).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/actors/src/validation_service/block_validation_task.rs`:
- Around line 530-539: The shadow-tx validation code currently maps every eyre
failure into ValidationResult::Invalid(ShadowTransactionInvalid) (e.g., the
block_tree_guard.read().get_commitment_snapshot(...) call and similar calls
later), which wrongly treats infra/runtime errors as consensus violations;
change the shadow-tx stage to return a typed Result<ValidationResult,
ShadowTxError> (or a ValidationResult enum that distinguishes Invalid(reason) vs
InternalFailure(error)), update callers of validate_shadow_transactions / the
shadow-tx stage to propagate InternalFailure for non-rule errors (including
errors from get_commitment_snapshot, execution-payload provider calls, and other
runtime failures) and only produce
ValidationResult::Invalid(ShadowTransactionInvalid) for explicit consensus rule
violations; ensure error construction preserves the original eyre error for
InternalFailure so logs/metrics can classify transient infra failures
separately.
---
Duplicate comments:
In `@crates/actors/src/validation_service/block_validation_task.rs`:
- Around line 407-423: Capture the parent epoch snapshot once at the start of
validate_block() by calling
block_tree_guard.read().get_epoch_snapshot(&block.previous_block_hash) and
storing the returned Arc as parent_epoch_snapshot (reuse the Some(snapshot)
branch value) instead of re-reading it later; then pass that same
parent_epoch_snapshot Arc into the PoA validation, shadow-tx/cascade validation
tasks and derive cascade_active from it via
config.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch_snapshot)
so all stages use the identical snapshot and activation rule (replace subsequent
calls that re-query the block tree with the stored parent_epoch_snapshot).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 556dd5d7-e59a-4338-9fff-e6ba26158f23
📒 Files selected for processing (1)
crates/actors/src/validation_service/block_validation_task.rs
e508db6 to
3e913fb
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/actors/src/block_validation.rs`:
- Around line 1424-1442: Convert the table-driven tests to rstest parameterized
cases: change the test function validation_cancel_reason_classifier_dispatch to
use #[rstest] with a #[case] for each ValidationCancelReason variant
(HeightDifference, ParentMissing, ChannelClosed), accept a parameter named
reason, and run the two assertions inside the body
(assert!(!reason.is_internal()) and
assert!(!ValidationError::ValidationCancelled { reason
}.is_internal_failure())). Do the same refactor for the other table-driven test
referenced (lines 1448-1501) so each variant is a separate #[case], producing
per-case test failures and following the repo convention.
- Around line 1507-1510: Add a regression test that constructs a "latest index"
entry that omits the target PoA ledger and calls poa_is_valid() on that
constructed entry, asserting it returns
Err(PreValidationError::PoALedgerInactive { ledger_id: <expected> });
specifically, instead of only instantiating
PreValidationError::PoALedgerInactive directly, build the latest index entry
(using the same builders/helpers used elsewhere in this module) without adding
the target ledger so the validation path that previously produced
BlockBoundsLookupError is exercised, then call poa_is_valid() and assert it
yields the PoALedgerInactive variant.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6f9bf9bd-2063-4459-b850-c8d26a1510e4
📒 Files selected for processing (4)
crates/actors/src/block_tree_service.rscrates/actors/src/block_validation.rscrates/actors/src/validation_service.rscrates/actors/src/validation_service/block_validation_task.rs
JesseTheRobot
left a comment
There was a problem hiding this comment.
@coderabbitai full review
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
crates/actors/src/block_validation.rs (1)
1487-1505: 🧹 Nitpick | 🔵 TrivialConvert these manual classifier tables to
rstestcases.These tests are already table-driven;
#[rstest]will give per-case failures and keep them aligned with the repo's test conventions. As per coding guidelinesUse rstest for parameterized tests.Also applies to: 1580-1608, 1614-1638, 1654-1704
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/actors/src/block_validation.rs` around lines 1487 - 1505, Replace the manual loop/table-style assertions in the test function validation_cancel_reason_classifier_dispatch with an rstest parameterized test: add #[rstest] and #[case(...)] entries for each ValidationCancelReason variant (e.g., HeightDifference, ParentMissing, ChannelClosed), accept a reason: ValidationCancelReason parameter, and run the two assertions against reason.is_internal() and ValidationError::ValidationCancelled { reason }.is_internal_failure(); do the same conversion for the other similar test blocks that check classifier tables so each case shows per-case failures under rstest.crates/actors/src/block_tree_service.rs (1)
435-485:⚠️ Potential issue | 🟠 MajorSoft
InternalFailurestill leaves the block permanently unschedulable.This path keeps the cache entry and returns without clearing the validation-scheduled state or providing another retry path. Because
on_block_prevalidated()already short-circuits when the block is still cached, a transient internal failure can strand the block — and any descendants waiting on it — until depth pruning eventually evicts it.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/actors/src/block_tree_service.rs` around lines 435 - 485, The InternalFailure branch in on_block_validation_finished currently leaves the cache entry as-is which preserves the "validation-scheduled" state and prevents on_block_prevalidated from ever retrying; change this path to clear the scheduled/validating flag and enqueue the block for retry (or at least mark it as not-scheduled) instead of just returning. Specifically, in the ValidationResult::InternalFailure arm in on_block_validation_finished, update the cached block status (via cache.get_block_and_status / cache mutation helpers) to remove the validation-scheduled marker (or set state to a retryable state), update chain_sync_state.record_block_validation_error as needed, and emit the BlockStateUpdated event with the new state so service_senders.block_state_events consumers can retry; ensure this complements on_block_prevalidated's short-circuit logic so transient internal failures are retried rather than permanently stranded.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/actors/src/block_validation.rs`:
- Around line 699-725: The is_internal_failure() method currently treats all
unlisted ValidationError variants as consensus-invalid via the `_ => false`
fallback; change that fallback to conservative `_ => true` and/or add explicit
match arms returning true for the variants produced by runtime/local-dependent
helpers (the errors constructed by recall_recall_range_is_valid(),
shadow_transactions_are_valid(), submit_payload_to_reth(), etc.) so those
runtime/transport/VDF-related failure cases are classified as internal; edit the
match in is_internal_failure to return true for those specific variants (or flip
the default) rather than defaulting to false.
---
Duplicate comments:
In `@crates/actors/src/block_tree_service.rs`:
- Around line 435-485: The InternalFailure branch in
on_block_validation_finished currently leaves the cache entry as-is which
preserves the "validation-scheduled" state and prevents on_block_prevalidated
from ever retrying; change this path to clear the scheduled/validating flag and
enqueue the block for retry (or at least mark it as not-scheduled) instead of
just returning. Specifically, in the ValidationResult::InternalFailure arm in
on_block_validation_finished, update the cached block status (via
cache.get_block_and_status / cache mutation helpers) to remove the
validation-scheduled marker (or set state to a retryable state), update
chain_sync_state.record_block_validation_error as needed, and emit the
BlockStateUpdated event with the new state so service_senders.block_state_events
consumers can retry; ensure this complements on_block_prevalidated's
short-circuit logic so transient internal failures are retried rather than
permanently stranded.
In `@crates/actors/src/block_validation.rs`:
- Around line 1487-1505: Replace the manual loop/table-style assertions in the
test function validation_cancel_reason_classifier_dispatch with an rstest
parameterized test: add #[rstest] and #[case(...)] entries for each
ValidationCancelReason variant (e.g., HeightDifference, ParentMissing,
ChannelClosed), accept a reason: ValidationCancelReason parameter, and run the
two assertions against reason.is_internal() and
ValidationError::ValidationCancelled { reason }.is_internal_failure(); do the
same conversion for the other similar test blocks that check classifier tables
so each case shows per-case failures under rstest.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9e6a57e1-ab83-4181-b413-d2c607c47e4c
📒 Files selected for processing (6)
crates/actors/src/block_tree_service.rscrates/actors/src/block_validation.rscrates/actors/src/validation_service.rscrates/actors/src/validation_service/block_validation_task.rscrates/chain-tests/src/block_production/block_validation.rscrates/p2p/src/block_pool.rs
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
♻️ Duplicate comments (4)
crates/actors/src/validation_service.rs (1)
436-462:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse a cancellation result for aborted concurrent tasks.
Line 436 already distinguishes
e.is_cancelled(), but Lines 455-462 still report both branches asValidationError::TaskPanicked. That turns aborted tasks intoInternalFailure(TaskPanicked), which skews both diagnostics and the new invalid/internal-failure split. Route cancellations through the existing cancellation variant instead of the panic variant.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/actors/src/validation_service.rs` around lines 436 - 462, The code treats both cancelled and panicked concurrent validation tasks as ValidationError::TaskPanicked; change the branch so when e.is_cancelled() is true you send the cancellation validation variant instead of TaskPanicked (i.e., use the existing cancellation enum variant — e.g. ValidationError::Cancelled or ValidationError::TaskCancelled — with the same task/details fields) in the call to self.send_validation_result, leaving the panic branch to continue sending ValidationError::TaskPanicked; update the error message/logging if needed so the sent result matches e.is_cancelled().crates/actors/src/block_tree_service.rs (1)
415-485:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftSoft
InternalFailurestill leaves the block unschedulable.This branch keeps the block in cache and returns without clearing
ValidationScheduledor re-queueing. Becauseon_block_prevalidatedexits early for cached blocks on Lines 296-303, neither re-gossip nor orphan resolution can retry validation until depth pruning evicts the entry. A transient internal failure can therefore stall the block and its descendants indefinitely.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/actors/src/block_tree_service.rs` around lines 415 - 485, The InternalFailure branch in on_block_validation_finished leaves the cache entry in ValidationScheduled, which prevents on_block_prevalidated from retrying the block; update this branch to clear the ValidationScheduled marker (or change the block's status to a non-scheduled state) and/or re-enqueue the block for prevalidation so transient internal failures don't stall validation permanently: locate the ValidationResult::InternalFailure handling in on_block_validation_finished, after recording the error with chain_sync_state.record_block_validation_error and before returning, mutate the cache entry that matches block_hash to remove ValidationScheduled (or set state to NotOnchain/Unknown) and/or push the block back onto the prevalidation queue so on_block_prevalidated can attempt validation again; ensure the BlockStateUpdated event (BlockStateUpdated { ... validation_result }) reflects the new status and that service_senders.block_state_events still broadcasts the update.crates/actors/src/block_validation.rs (1)
703-739:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftKeep mixed runtime wrappers out of the consensus-invalid fallback.
RecallRangeInvalidandExecutionLayerFailedstill fall through_ => falsehere, but both construction paths can carry local/runtime failures:recall_recall_range_is_valid()can fail onsteps_guard.read().get_steps(...), andsubmit_payload_to_reth()still uses?for engine RPC/transport errors. In those cases a local outage is still reported asInvalid, so a valid block gets discarded instead of retained for retry/escalation.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/actors/src/block_validation.rs` around lines 703 - 739, The is_internal_failure() matcher currently treats RecallRangeInvalid and ExecutionLayerFailed as consensus-invalid (false) even though their inner causes can be local/runtime outages; update is_internal_failure() to inspect those variant payloads and return true when their contained error indicates a local/internal failure (e.g., errors produced by recall_recall_range_is_valid()’s steps_guard read/get_steps or engine/transport errors from submit_payload_to_reth()). Concretely, add arms for Self::RecallRangeInvalid { source, .. } and Self::ExecutionLayerFailed { source, .. } (or the exact field names in ValidationError) and call a helper on the inner error (or match its concrete error kinds) to return true for IO/transport/lock/read-related failures and false for genuine consensus-invalid reasons so that local outages are classified as InternalFailure instead of Invalid.crates/actors/src/validation_service/block_validation_task.rs (1)
424-440:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReuse the first parent epoch snapshot instead of re-reading it.
This task fetches the parent epoch snapshot for PoA and then fetches the same snapshot again before spawning the shadow/data stages. If the parent is evicted between those two reads, a block that already had a valid captured snapshot now fails with
ParentEpochSnapshotMissing. Clone the firstArc<EpochSnapshot>into the later tasks and only fetch the EMA snapshot in the second block.
Based on learnings: ensure the validator uses the same activation rule as the producer by derivingcascade_activefrom theparent_epoch_snapshotviaconfig.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch_snapshot).Also applies to: 516-553
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/actors/src/validation_service/block_validation_task.rs` around lines 424 - 440, The code currently re-reads the parent epoch snapshot (via block_tree_guard.read().get_epoch_snapshot(&block.previous_block_hash)) before spawning shadow/data stages which can fail if the snapshot is evicted; instead, clone the already obtained parent_epoch_snapshot Arc and pass that clone into the later tasks, remove the second get_epoch_snapshot call so only the EMA snapshot is fetched there, and compute cascade_active by calling config.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch_snapshot) (apply the same change to the analogous block around the 516-553 range). Ensure all references to the parent snapshot in spawn/shadow/data-stage code use the cloned parent_epoch_snapshot variable rather than re-reading from block_tree_guard.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@crates/actors/src/block_tree_service.rs`:
- Around line 415-485: The InternalFailure branch in
on_block_validation_finished leaves the cache entry in ValidationScheduled,
which prevents on_block_prevalidated from retrying the block; update this branch
to clear the ValidationScheduled marker (or change the block's status to a
non-scheduled state) and/or re-enqueue the block for prevalidation so transient
internal failures don't stall validation permanently: locate the
ValidationResult::InternalFailure handling in on_block_validation_finished,
after recording the error with chain_sync_state.record_block_validation_error
and before returning, mutate the cache entry that matches block_hash to remove
ValidationScheduled (or set state to NotOnchain/Unknown) and/or push the block
back onto the prevalidation queue so on_block_prevalidated can attempt
validation again; ensure the BlockStateUpdated event (BlockStateUpdated { ...
validation_result }) reflects the new status and that
service_senders.block_state_events still broadcasts the update.
In `@crates/actors/src/block_validation.rs`:
- Around line 703-739: The is_internal_failure() matcher currently treats
RecallRangeInvalid and ExecutionLayerFailed as consensus-invalid (false) even
though their inner causes can be local/runtime outages; update
is_internal_failure() to inspect those variant payloads and return true when
their contained error indicates a local/internal failure (e.g., errors produced
by recall_recall_range_is_valid()’s steps_guard read/get_steps or
engine/transport errors from submit_payload_to_reth()). Concretely, add arms for
Self::RecallRangeInvalid { source, .. } and Self::ExecutionLayerFailed { source,
.. } (or the exact field names in ValidationError) and call a helper on the
inner error (or match its concrete error kinds) to return true for
IO/transport/lock/read-related failures and false for genuine consensus-invalid
reasons so that local outages are classified as InternalFailure instead of
Invalid.
In `@crates/actors/src/validation_service.rs`:
- Around line 436-462: The code treats both cancelled and panicked concurrent
validation tasks as ValidationError::TaskPanicked; change the branch so when
e.is_cancelled() is true you send the cancellation validation variant instead of
TaskPanicked (i.e., use the existing cancellation enum variant — e.g.
ValidationError::Cancelled or ValidationError::TaskCancelled — with the same
task/details fields) in the call to self.send_validation_result, leaving the
panic branch to continue sending ValidationError::TaskPanicked; update the error
message/logging if needed so the sent result matches e.is_cancelled().
In `@crates/actors/src/validation_service/block_validation_task.rs`:
- Around line 424-440: The code currently re-reads the parent epoch snapshot
(via block_tree_guard.read().get_epoch_snapshot(&block.previous_block_hash))
before spawning shadow/data stages which can fail if the snapshot is evicted;
instead, clone the already obtained parent_epoch_snapshot Arc and pass that
clone into the later tasks, remove the second get_epoch_snapshot call so only
the EMA snapshot is fetched there, and compute cascade_active by calling
config.consensus.hardforks.is_cascade_active_for_epoch(&parent_epoch_snapshot)
(apply the same change to the analogous block around the 516-553 range). Ensure
all references to the parent snapshot in spawn/shadow/data-stage code use the
cloned parent_epoch_snapshot variable rather than re-reading from
block_tree_guard.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0de83319-ef08-4782-b2a7-3ae735dc4776
📒 Files selected for processing (6)
crates/actors/src/block_tree_service.rscrates/actors/src/block_validation.rscrates/actors/src/validation_service.rscrates/actors/src/validation_service/block_validation_task.rscrates/chain-tests/src/block_production/block_validation.rscrates/p2p/src/block_pool.rs
3ed807a to
425f23d
Compare
|
@coderrabbitai full review |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/domain/src/models/block_index.rs (1)
253-293:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSparse ledger entries still break anchored lookups.
This method assumes the anchor block itself contains
ledger, and later treats missing probed ledgers as0. That only works for “ledger not introduced yet”. In this file,push_block()writes entries only for ledgers present inblock.data_ledgers, so if a later block omits an already-existing ledger, anchoring to that height will either fail early or binary-search with the wrong total. The anchored query needs the carried-forward total from the latest<= anchor_heightblock that has that ledger.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/domain/src/models/block_index.rs` around lines 253 - 293, The anchored lookup wrongly assumes the anchor block contains the ledger and treats missing ledgers during probes as total_chunks=0; instead, when computing anchor_max you must walk backwards from anchor_height to find the most recent block <= anchor_height that contains ledger and use its total_chunks (use block_index_item_by_height in a loop until found or zero), and likewise when probing in the binary search replace unwrap_or(0) behavior with a carry-forward lookup that finds the latest prior block <= mid that has the ledger (not just treating absent ledgers as 0); adjust uses of chunk_offset, anchor_height and ledger accordingly so anchors and mid probes use carried-forward totals consistent with how push_block writes only present ledgers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/actors/src/shadow_tx_generator.rs`:
- Around line 237-255: The check that detects a promoted publish transaction
also scheduled for a perm_fee refund should return a peer-structure error, not a
snapshot invariant; change the error constructor used in the loop that iterates
publish_ledger.txs and ledger_expiry_balance_delta.user_perm_fee_refunds from
ShadowTxGenError::SnapshotInvariant(...) to ShadowTxGenError::Structural(...) so
the condition is classified as a malformed block shape (consistent with the
surrounding comment and generate_expected_shadow_transactions validation
behavior).
---
Outside diff comments:
In `@crates/domain/src/models/block_index.rs`:
- Around line 253-293: The anchored lookup wrongly assumes the anchor block
contains the ledger and treats missing ledgers during probes as total_chunks=0;
instead, when computing anchor_max you must walk backwards from anchor_height to
find the most recent block <= anchor_height that contains ledger and use its
total_chunks (use block_index_item_by_height in a loop until found or zero), and
likewise when probing in the binary search replace unwrap_or(0) behavior with a
carry-forward lookup that finds the latest prior block <= mid that has the
ledger (not just treating absent ledgers as 0); adjust uses of chunk_offset,
anchor_height and ledger accordingly so anchors and mid probes use
carried-forward totals consistent with how push_block writes only present
ledgers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: de963991-3d42-46b3-961d-b9b80fd5121f
📒 Files selected for processing (22)
MAINNET_BETA.mdcrates/actors/src/block_producer.rscrates/actors/src/block_tree_service.rscrates/actors/src/block_validation.rscrates/actors/src/commitment_refunds.rscrates/actors/src/metrics.rscrates/actors/src/shadow_tx_generator.rscrates/actors/src/validation_service.rscrates/actors/src/validation_service/active_validations.rscrates/actors/src/validation_service/block_validation_task.rscrates/actors/tests/block_validation_tests.rscrates/chain-tests/src/block_production/block_validation.rscrates/chain-tests/src/multi_node/vdf_validation_progress.rscrates/chain-tests/src/utils.rscrates/chain-tests/src/validation/data_tx_pricing.rscrates/chain-tests/src/validation/poa_cases.rscrates/domain/src/models/block_index.rscrates/domain/src/models/execution_payload_cache.rscrates/p2p/src/block_pool.rscrates/p2p/src/block_status_provider.rscrates/p2p/src/tests/block_pool/mod.rscrates/p2p/src/tests/integration/mod.rs
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 7
♻️ Duplicate comments (1)
crates/actors/src/shadow_tx_generator.rs (1)
267-271:⚠️ Potential issue | 🟠 Major | ⚡ Quick winClassify the promoted publish/refund overlap as
Structural.This guard is still rejecting malformed peer block shape, not corrupt local snapshot state. Returning
SnapshotInvarianthere routes the same input through the internal/node-fault path instead of consensus rejection for any caller that misses the earlier validation-site check.Minimal fix
- return Err(ShadowTxGenError::SnapshotInvariant(format!( + return Err(ShadowTxGenError::Structural(format!( "Transaction {} is in publish ledger but also has a perm_fee refund scheduled. \ Promoted transactions should not receive refunds.", tx.id )));🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/actors/src/shadow_tx_generator.rs` around lines 267 - 271, The error returned when a transaction is both in the publish ledger and has a perm_fee refund should be classified as structural, not a snapshot invariant: update the return in shadow_tx_generator (where it currently returns ShadowTxGenError::SnapshotInvariant(...)) to return ShadowTxGenError::Structural(...) with the same descriptive message so malformed peer block shape is routed as consensus/structural rejection rather than an internal snapshot invariant failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/actors/src/validation_service/block_validation_task.rs`:
- Around line 856-895: The cancel path currently uses tokio::join! (stages_join)
which only becomes Ready when all six stage futures finish, so a
partially-completed stage (e.g., a NodeFault) is missed when cancellation fires;
fix by replacing the tokio::join!-based composite (stages_join) with a drainable
collection (e.g., futures::stream::FuturesUnordered or individually polled
futures) that yields per-stage completions, wire that into the select with
exit_if_block_is_too_old and, in the cancel branch after aborting poa via
poa_abort_slot, repeatedly poll/drain the collection to collect any Ready stage
results and then call merge_stage_results_with_cancel (using the same
StageOutcome variants: Joined, JoinedAfterCancel, Cancelled) so
already-completed NodeFaults are observed and merged before returning.
- Around line 561-584: The Err branch in the match that maps `res` to
`ValidationResult` currently treats all `Err(err)` as
`ValidationError::TaskPanicked`; update that branch to first check
`err.is_cancelled()` (the `JoinError` from the PoA `JoinHandle`) and if true
return `ValidationError::ValidationCancelled` (preserving a helpful `details`
string), otherwise preserve the existing panic handling and return
`ValidationError::TaskPanicked` (as currently done); refer to the match around
`let result: ValidationResult = match res { ... }`, the `AbortHandle::abort()`
usage that can produce cancelled `JoinError`, and the
`ValidationError::TaskPanicked` / `ValidationError::ValidationCancelled`
variants to locate where to implement the `is_cancelled()` check and mapping.
In `@crates/chain-tests/src/validation/data_tx_pricing.rs`:
- Around line 149-155: The test currently treats any
ValidationError::ShadowTransactionInvalid(_) as acceptable which allows
unrelated failures to pass; update both match arms that now use
ValidationError::ShadowTransactionInvalid(_) so they only accept the
shadow-validation variant(s) that indicate a perm-fee problem (i.e., replace the
wildcard with the concrete inner variant(s) that represent perm-fee errors from
the ShadowTransactionInvalid payload—for example
InsufficientPermFee/MissingPermFee-style variants) and apply the same narrowed
pattern to the other identical assertion so only perm-fee-related shadow
failures satisfy the test.
In `@crates/domain/src/models/execution_payload_cache.rs`:
- Around line 330-359: Wrap both the network request and the receiver await
under the same tokio::time::timeout so the configured wait_timeout applies to
the full request_payload_from_the_network(...) + receiver wait path (start the
Instant before calling request_payload_from_the_network and use
timeout(self.wait_timeout, async { request...; receiver.await })). On the
timeout branch clear the in-flight marker in
payloads_currently_requested_from_the_network for the given evm_block_hash and
also remove the orphaned sender from payload_senders (call pop(evm_block_hash))
so is_waiting_for_payload() becomes false; preserve returning
ExecutionPayloadWaitError::WaitTimeout with elapsed_ms.
- Around line 356-359: The numeric narrowing of started.elapsed().as_millis()
(u128) to u64 should use checked conversion instead of an as cast; replace
occurrences like the elapsed_ms: started.elapsed().as_millis() as u64 in the
ExecutionPayloadWaitError::WaitTimeout construction with a checked conversion
using u64::try_from(started.elapsed().as_millis()) and handle the Result (for
example with unwrap_or(u64::MAX) or propagating an error) and apply the same
change to all similar instances in this file.
In `@crates/p2p/src/block_status_provider.rs`:
- Around line 504-526: The test block_status_returns_in_tree_pending_validation
is missing the ChainState::NotOnchain(BlockState::ValidationScheduled) case —
add a new #[case(ChainState::NotOnchain(BlockState::ValidationScheduled))] to
the rstest parameterization so the test covers all four ChainState variants that
should map to InTreePendingValidation (refer to the function
block_status_returns_in_tree_pending_validation and the BlockState/ChainState
enums when adding the case).
In `@crates/types/src/config/node.rs`:
- Around line 731-740: The tests need to assert the new serialized default for
the execution payload timeout so refactors don't silently change behavior:
update the config serialization/legacy/testing tests that exercise defaults
(where other defaults are asserted) to include an assertion for the public field
execution_payload_wait_timeout_millis and/or its provider
default_execution_payload_wait_timeout_millis (expected value 60_000 or test
override 5_000 depending on the test case), and add equivalent checks in the
legacy/roundtrip config tests so both wire and runtime defaults are pinned.
---
Duplicate comments:
In `@crates/actors/src/shadow_tx_generator.rs`:
- Around line 267-271: The error returned when a transaction is both in the
publish ledger and has a perm_fee refund should be classified as structural, not
a snapshot invariant: update the return in shadow_tx_generator (where it
currently returns ShadowTxGenError::SnapshotInvariant(...)) to return
ShadowTxGenError::Structural(...) with the same descriptive message so malformed
peer block shape is routed as consensus/structural rejection rather than an
internal snapshot invariant failure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0908b3aa-de0b-4649-ac73-8162f22a0de3
📒 Files selected for processing (25)
MAINNET_BETA.mdcrates/actors/src/block_producer.rscrates/actors/src/block_tree_service.rscrates/actors/src/block_validation.rscrates/actors/src/commitment_refunds.rscrates/actors/src/metrics.rscrates/actors/src/shadow_tx_generator.rscrates/actors/src/validation_service.rscrates/actors/src/validation_service/active_validations.rscrates/actors/src/validation_service/block_validation_task.rscrates/actors/tests/block_validation_tests.rscrates/chain-tests/src/block_production/block_validation.rscrates/chain-tests/src/multi_node/vdf_validation_progress.rscrates/chain-tests/src/utils.rscrates/chain-tests/src/validation/data_tx_pricing.rscrates/chain-tests/src/validation/poa_cases.rscrates/chain/src/chain.rscrates/domain/src/models/block_index.rscrates/domain/src/models/execution_payload_cache.rscrates/p2p/src/block_pool.rscrates/p2p/src/block_status_provider.rscrates/p2p/src/tests/block_pool/mod.rscrates/p2p/src/tests/integration/mod.rscrates/p2p/src/tests/util.rscrates/types/src/config/node.rs
| self.request_payload_from_the_network(*evm_block_hash, request_only_from_trusted_peers) | ||
| .await; | ||
| receiver.await.ok() | ||
| // Wrap the wait in `tokio::time::timeout` so a peer that advertises | ||
| // a block header but never serves the EVM payload cannot stall the | ||
| // validation task until the `payload_senders` LRU happens to evict | ||
| // our slot (essentially unbounded under low load). The bare | ||
| // `receiver.await` errors only when the matching oneshot sender is | ||
| // dropped (LRU eviction or explicit cache removal); surface that as | ||
| // `ReceiverDisrupted`, and surface a `timeout`-arm trip as the | ||
| // distinct `WaitTimeout` so operators can root-cause from logs. | ||
| // Both map to `ValidationError::ExecutionPayloadCacheEvicted` | ||
| // downstream — see `ExecutionPayloadWaitError` doc for the caller | ||
| // contract. | ||
| let started = std::time::Instant::now(); | ||
| match tokio::time::timeout(self.wait_timeout, receiver).await { | ||
| Ok(Ok(sealed_block)) => Ok(sealed_block), | ||
| Ok(Err(_)) => Err(ExecutionPayloadWaitError::ReceiverDisrupted { | ||
| evm_block_hash: *evm_block_hash, | ||
| }), | ||
| Err(_) => { | ||
| // Pop the now-orphaned sender entry so it cannot leak into | ||
| // the 1000-slot LRU and displace legitimate waiters. Same | ||
| // cleanup the cache-hit fast path performs above; receiver | ||
| // has been consumed by `timeout` so there's nothing else to | ||
| // drop here. | ||
| self.payload_senders.write().await.pop(evm_block_hash); | ||
| Err(ExecutionPayloadWaitError::WaitTimeout { | ||
| evm_block_hash: *evm_block_hash, | ||
| elapsed_ms: started.elapsed().as_millis() as u64, | ||
| }) |
There was a problem hiding this comment.
Make the timeout cover the full request+wait path, and clear the in-flight marker on expiry.
Right now request_payload_from_the_network(...).await can burn the full 10×5s retry budget before tokio::time::timeout(...) even starts, so this path can still block for roughly 110s. The timeout arm also leaves payloads_currently_requested_from_the_network populated, so is_waiting_for_payload() can remain true after the request loop and waiter have both already ended.
Suggested fix
- self.request_payload_from_the_network(*evm_block_hash, request_only_from_trusted_peers)
- .await;
+ let started = std::time::Instant::now();
+ self.request_payload_from_the_network(*evm_block_hash, request_only_from_trusted_peers)
+ .await;
+ let remaining = self.wait_timeout.saturating_sub(started.elapsed());
@@
- let started = std::time::Instant::now();
- match tokio::time::timeout(self.wait_timeout, receiver).await {
+ match tokio::time::timeout(remaining, receiver).await {
Ok(Ok(sealed_block)) => Ok(sealed_block),
Ok(Err(_)) => Err(ExecutionPayloadWaitError::ReceiverDisrupted {
evm_block_hash: *evm_block_hash,
}),
Err(_) => {
+ {
+ let mut cache = self.cache.write().await;
+ cache
+ .payloads_currently_requested_from_the_network
+ .pop(evm_block_hash);
+ }
self.payload_senders.write().await.pop(evm_block_hash);
Err(ExecutionPayloadWaitError::WaitTimeout {
evm_block_hash: *evm_block_hash,
elapsed_ms: started.elapsed().as_millis() as u64,
})🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/domain/src/models/execution_payload_cache.rs` around lines 330 - 359,
Wrap both the network request and the receiver await under the same
tokio::time::timeout so the configured wait_timeout applies to the full
request_payload_from_the_network(...) + receiver wait path (start the Instant
before calling request_payload_from_the_network and use
timeout(self.wait_timeout, async { request...; receiver.await })). On the
timeout branch clear the in-flight marker in
payloads_currently_requested_from_the_network for the given evm_block_hash and
also remove the orphaned sender from payload_senders (call pop(evm_block_hash))
so is_waiting_for_payload() becomes false; preserve returning
ExecutionPayloadWaitError::WaitTimeout with elapsed_ms.
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
…lure Expand PreValidationError::is_internal_failure() beyond the bootstrap InternalTaskJoin entry to cover every variant whose construction sites are unambiguously local/runtime failures: - DatabaseError — every site wraps an MDBX db.tx()/view() failure - PreviousTxInclusionsFailed — block_tree channel/actor failure - AddBlockFailed, UpdateCacheForScheduledValidationError — local in-memory block-tree cache mutation - ValidationServiceUnreachable — local actor channel dead - SystemTimeError — OS clock failure - RewardCurveError — local arithmetic on height-derived inputs (peer-supplied reward is checked separately via RewardMismatch) - FeeCalculationFailed — local arithmetic on config inputs (per-tx comparisons use the Insufficient*Fee variants) - EmaSnapshotError — local snapshot computation, no real failure path These previously routed through block_pool as `CriticalBlockPoolError:: BlockError`, which is peer-attributed bogus data. With this change they route through OtherInternal (peer is innocent) and the cached block is retained for retry via the path added for InternalTaskJoin. Also delete two dead variants flagged during the audit (TransactionFetchFailed in PreValidationError and CommitmentTransactionFetchFailed in ValidationError) which have zero construction sites in the workspace. Variants still requiring source-side disambiguation before adding (handled in subsequent commits): BlockEmaSnapshotNotFound, ParentEpochSnapshotNotFound, BlockBoundsLookupError.
…nternal
BlockEmaSnapshotNotFound and ParentEpochSnapshotNotFound only fire from
one construction site each, inside data_txs_are_valid, when the parent
block's snapshot has fallen out of the in-memory block-tree window. This
is a race against eviction (parent existed at prevalidation time and was
verified there) — the block's validity is unknown locally, not bad.
The old names suggested a consensus failure ("not found"). Rename to
make the locality explicit and add doc comments capturing the race
semantics. Classify both as is_internal_failure so block_pool routes
them through OtherInternal (peer innocent) instead of BlockError.
Field renamed from `block_hash` to `parent_hash` so the variant body
unambiguously identifies which block's snapshot is missing.
BlockBoundsLookupError had three construction sites, one of which was overloaded: the PoA validation call to block_index.get_block_bounds could fail either because the peer's offset is past the chain (a consensus failure) or because of empty index / DB I/O / ledger missing (internal). Pre-check at the call site: read the latest block_index item's total_chunks for the ledger; if the peer-supplied ledger_chunk_offset is past it, reject as PoAChunkOffsetOutOfBlockBounds (existing consensus variant). Only after that pre-check do we call get_block_bounds, so any error it surfaces from this site is now unambiguously local. The other two construction sites (in get_assigned_ingress_proofs) feed block hashes already present in our own DB cache, so failures there are already pure DB consistency issues. Classify BlockBoundsLookupError as is_internal_failure now that every construction site is unambiguously local. Add a classifier test.
…a From Previously, every Err returned from the concurrent validation tasks (data_txs_are_valid, poa_is_valid, etc.) was wrapped into ValidationResult::Invalid by hand at the call site. That meant a local race — e.g. block_tree evicting the parent's snapshot mid-validation, surfacing as PreValidationError::LocalEmaSnapshotMissing — would mark the block as consensus-invalid and BlockTreeService would discard it. Make ValidationResult variant selection a single From impl: - ValidationError gains is_internal_failure(), delegating to PreValidationError::is_internal_failure() for the PreValidation variant and additionally flagging TaskPanicked, ValidationCancelled, ParentCommitmentSnapshotMissing, ParentEpochSnapshotMissing, ParentBlockMissing. - New ValidationResult::InternalFailure(ValidationError) variant. - From<ValidationError> for ValidationResult dispatches on the classifier; From<PreValidationError> chains through. - ValidationResult::metric_label() returns "valid"/"invalid"/"internal_error". Every call site in block_validation_task.rs that built a ValidationResult::Invalid is refactored to use .into() so the dispatch is automatic. Concurrent-stage fan-in prefers an explicit Invalid over an InternalFailure (consensus rejection is the stronger signal). BlockTreeService::on_block_validation_finished handles InternalFailure explicitly: log clearly, keep the block in cache, emit BlockStateUpdated with discarded=false. The block can be re-validated on next gossip or when the underlying cause clears. Tests cover both From paths and the metric_label dispatch.
…payload ValidationResult::InternalFailure previously accepted any ValidationError, relying on a SAFETY-CRITICAL doc comment and the From impl to enforce the invariant that only locally-classified errors live inside it. Nothing prevented manually writing ValidationResult::InternalFailure(ValidationError::ShadowTransactionInvalid(_)) — which would tell BlockTreeService "this consensus failure is just an internal blip, keep the block in cache and retry" and is exactly the mis-classification the rest of this branch is designed to prevent. Replace the variant payload with a sealed wrapper `InternalFailureError` whose only field is private and whose only construction path is the existing `From<ValidationError> for ValidationResult` impl (which checks `is_internal_failure()` before wrapping). Direct construction outside the module is now a compile error. Consumers read the underlying error via `inner.err()` and the wrapper Displays through to the inner. Updates: 7 consumer sites in block_validation_task.rs (sub-variant matching for metric labels + fan-in clone), 1 in block_tree_service.rs (uses Display directly). New test confirming `err()` exposes the expected inner. Also clean up a doc-string imprecision flagged in review: ValidationError::is_internal_failure delegates to (not "mirrors") PreValidationError::is_internal_failure for the PreValidation arm.
…ed BlockBoundsLookupError
…ft was only produced by 3 tests; ShadowTxGenerationFailed had no other production producers; both deleted along with the classifier arm and test usages
… ShadowTxGenerationFailed
…U pop to confirmed-success boundaries
Bug 1: the pre-write-lock cache-miss early-return suppressed `BlockStateUpdated`
on a Valid arrival — symmetric with the discard-path bug fixed in `1f85702fc`.
Fix: emit `BlockStateUpdated{discarded:false, height:0, state:NotOnchain(Unknown)}`
before the early-return, replacing the LATENT comment that described the problem.
Test: update `valid_result_with_cache_miss_preserves_lru` to also assert the
broadcast fires (assertion b).
Bug 2: the post-write-lock second cache check (concurrent-prune race) returned
silently after `mark_block_as_valid` had already succeeded. Fix: drop the write
lock and emit `BlockStateUpdated` before the early-return so downstream waiters
don't hang. Deferred test comment explains why direct reproduction requires a
test seam not yet available.
Bug 3: the recovery LRU pop + metric tick + `info!` fired before the write lock,
meaning a spurious Valid arrival (cache miss) or `mark_block_as_valid` failure
would pop the LRU and increment the counter incorrectly. Fix: move all three side
effects to the two confirmed-success boundaries (no-new-tip branch and final
path). The Bug 2 concurrent-prune path deliberately does NOT pop the LRU.
Also: add `RecallRangeStepsUnavailable` arm to `soft_internal_reason_tag` (new
SoftInternal variant added by parallel agent on this branch); add matching test
case and include it in the unique-per-variant coverage list.
…in no-new-tip branch; fix harness to seed ValidationScheduled state The previous commit set Bug 3 success boundary 2 (final path) but was missing boundary 1 (the no-new-tip early-return branch where mark_block_as_valid succeeded but get_earliest_not_onchain_in_longest_chain returned None). Add the recovery LRU pop + metric + info! log there. Also fix DiscardHarness::insert_block_in_cache to call mark_block_as_validation_scheduled after add_block so that on_block_validation_finished can reach mark_block_as_valid successfully in tests.
…ed_ingress_proofs call The parallel agent that added cache_sender to get_assigned_ingress_proofs missed the tx_selector call site. Block-production path does not have ServiceSenders so None is the correct argument; stale block_set hashes are validation-concern only.
…sing recovery path When get_ledger_range returns Ok(None) for a hash stored in CachedDataRoot.block_set (pruned side-fork or predecessor missing), the hash was never removed from the set. Every re-delivery of the block would re-encounter the same dead hash and produce AssignedProofBlockMissing, parking the block forever. Fix: add PruneBlockHashFromBlockSet to CacheServiceAction. Validation sends it on Ok(None); the cache service dispatches a background thread (spawn_prune_block_hash_task, mirroring spawn_prune_txids_task) to retain()-remove the hash from the DB record. No mutation from the validation path — the cache service owns all CachedDataRoots writes.
…e constant
The const _: fn() = || { ... } pattern does not trigger the dead_code
lint in Rust, so the #[expect] attribute was itself generating an
unfulfilled_lint_expectations warning. Remove the attribute; the
compile-time exhaustiveness check works regardless.
Reformat lines touched by the recent refactors: - Task C tripwire match arm + trailing comment - Task D nested map_err / ok_or_else chain - Task E removed-variant trailing blank lines and test rewrite
- cargo fmt fix in cache_service.rs test introduced by the PruneBlockHashFromBlockSet test (a8e6c16). - Commit packing_service proptest regression seed; the file itself documents that these should be checked in.
C1: fix comment in gossip_data_handler.rs — referenced handle_block
(non-existent); correct name is handle_block_header.
C2: fix MAX_CONCURRENT_CANCEL_RETRIES doc — "Past three" implied the
cap was exclusive; code gives up AT three; say "At three".
C3: replace unchecked as_millis() as u64 narrowing with
u64::try_from(...).unwrap_or(u64::MAX) in two sites in
execution_payload_cache.rs (wait_for_payload + test helper).
C4: add test_config_uses_short_payload_wait_timeout to run_mode_tests
so a silent revert of execution_payload_wait_timeout_millis back
to the 60s prod default surfaces as a test failure rather than a
multi-minute hang.
C5: document in discard_and_broadcast that the LRU + metric are set
only for the parent block_hash, not for children swept by the
recursive remove_block — this is intentional.
C6: remove two false claims about AbortHandle::abort() stopping a
spawn_blocking thread / preserving the panic→TaskPanicked path in
block_validation_task.rs; replace with accurate descriptions
pointing at the poa_abort_slot.set(...) site for design rationale.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
crates/actors/src/block_tree_service.rs (1)
1320-1365:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFail fast when
remove_blockfails.If
cache.remove_block(&block_hash)returnsErr, this still falls through to aBlockStateUpdated { discarded: true, ... }broadcast below. That tells downstream consumers the block was discarded even though it may still be present in the cache, which can desynchronize retry/finalization logic.Suggested fix
- let cache_write_succeeded = if maybe_height.is_some() { - match cache.remove_block(&block_hash) { - Ok(()) => true, - Err(err) => { - tracing::error!(block.hash = %block_hash, ?err, "Failed to remove block from cache"); - false - } - } + let cache_write_succeeded = if maybe_height.is_some() { + if let Err(err) = cache.remove_block(&block_hash) { + return Err(eyre::eyre!( + "failed to remove discarded block {block_hash} from cache: {err}" + )); + } + true } else { tracing::debug!(block.hash = %block_hash, "Block already removed from cache (or never inserted; e.g. direct validation_service submission)"); false };🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/actors/src/block_tree_service.rs` around lines 1320 - 1365, The code currently logs errors from cache.remove_block(&block_hash) but continues to broadcast BlockStateUpdated { discarded: true } even when the removal failed; change the control flow in the function containing cache.remove_block so that on Err(err) you fail fast: after logging the error (tracing::error!(...)), abort processing for this block (e.g., return an Err or early-return/continue from the caller) and do not construct or broadcast the BlockStateUpdated with discarded: true; ensure this affects the cache_write_succeeded check and the subsequent soft-internal LRU/metric logic so they only run when remove_block returned Ok.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/actors/src/cache_service.rs`:
- Around line 545-573: spawn_prune_block_hash_task currently spawns an unbounded
OS thread per prune request, leading to resource pressure; change it to enqueue
work on a serialized/bounded worker (reuse the existing
PruneTxidsFromCachedDataRoots queue/worker pattern) instead of calling
std::thread::spawn: create or use the component that processes a
single-threaded/prioritized queue, push a (data_root, block_hash) task into that
queue, and have the worker call the same db.update_eyre closure with
cached_data_root_by_data_root and db_tx.put::<CachedDataRoots> to perform the
prune and logging; ensure the queue is bounded/backpressured and that errors are
logged the same way as the current warn block.
In `@crates/actors/src/validation_service.rs`:
- Around line 424-429: The call to vdf_terminal_finalize_via(...) ignores its
bool return, so when the block-tree channel is closed the Invalid/ParentMissing
terminal verdicts get dropped without executing the existing fallback logic
(record_validation_finished / record_block_validation_error); update the
ParentMissing and other terminal arms (locations using vdf_terminal_finalize_via
in validation_service.rs, including the blocks around lines shown) to check the
returned bool and if false run the same fallback handling (call
record_validation_finished or record_block_validation_error as appropriate), or
refactor vdf_terminal_finalize_via to return/propagate its success and perform
the fallback internally so callers cannot ignore the failure.
---
Duplicate comments:
In `@crates/actors/src/block_tree_service.rs`:
- Around line 1320-1365: The code currently logs errors from
cache.remove_block(&block_hash) but continues to broadcast BlockStateUpdated {
discarded: true } even when the removal failed; change the control flow in the
function containing cache.remove_block so that on Err(err) you fail fast: after
logging the error (tracing::error!(...)), abort processing for this block (e.g.,
return an Err or early-return/continue from the caller) and do not construct or
broadcast the BlockStateUpdated with discarded: true; ensure this affects the
cache_write_succeeded check and the subsequent soft-internal LRU/metric logic so
they only run when remove_block returned Ok.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4ab39477-5fbd-4e69-b69d-d0c8b411f610
📒 Files selected for processing (13)
crates/actors/proptest-regressions/packing_service/mod.txtcrates/actors/src/block_tree_service.rscrates/actors/src/block_validation.rscrates/actors/src/cache_service.rscrates/actors/src/shadow_tx_generator.rscrates/actors/src/tx_selector/mod.rscrates/actors/src/validation_service.rscrates/actors/src/validation_service/active_validations.rscrates/actors/src/validation_service/block_validation_task.rscrates/domain/src/models/execution_payload_cache.rscrates/p2p/src/block_pool.rscrates/p2p/src/gossip_data_handler.rscrates/types/src/config/node.rs
| fn spawn_prune_block_hash_task(&self, data_root: DataRoot, block_hash: H256) { | ||
| let clone = self.clone(); | ||
| std::thread::spawn(move || { | ||
| let result = clone.db.update_eyre(|db_tx| { | ||
| let Some(mut cached) = cached_data_root_by_data_root(db_tx, data_root)? else { | ||
| return Ok(()); | ||
| }; | ||
| let before_len = cached.block_set.len(); | ||
| cached.block_set.retain(|h| h != &block_hash); | ||
| if cached.block_set.len() < before_len { | ||
| debug!( | ||
| data_root = %data_root, | ||
| block_hash = %block_hash, | ||
| "Pruned stale block hash from CachedDataRoot.block_set" | ||
| ); | ||
| db_tx.put::<CachedDataRoots>(data_root, cached)?; | ||
| } | ||
| Ok(()) | ||
| }); | ||
| if let Err(e) = result { | ||
| warn!( | ||
| data_root = %data_root, | ||
| block_hash = %block_hash, | ||
| "Failed to prune stale block hash from CachedDataRoot.block_set: {}", | ||
| e | ||
| ); | ||
| } | ||
| }); | ||
| } |
There was a problem hiding this comment.
Unbounded thread spawning in block-hash pruning path can cause resource pressure.
Line 1023 dispatches every prune request straight into spawn_prune_block_hash_task, and Line 547 creates a fresh OS thread per request. Unlike the pruning/epoch/txid paths, this has no queue/backpressure/completion gating, so bursts of stale-hash events can create unbounded concurrent threads.
Please route this action through a serialized queue (same pattern as PruneTxidsFromCachedDataRoots) or a bounded worker path.
Suggested direction
pub struct ChunkCacheService {
...
+ block_hash_prune_running: bool,
+ block_hash_prune_queue: VecDeque<(DataRoot, H256)>,
}
pub enum CacheServiceAction {
...
+ PruneBlockHashCompleted,
}
CacheServiceAction::PruneBlockHashFromBlockSet { data_root, block_hash } => {
- self.cache_task.spawn_prune_block_hash_task(data_root, block_hash);
+ self.block_hash_prune_queue.push_back((data_root, block_hash));
+ if !self.block_hash_prune_running
+ && let Some((dr, bh)) = self.block_hash_prune_queue.pop_front()
+ {
+ self.block_hash_prune_running = true;
+ self.cache_task.spawn_prune_block_hash_task(dr, bh); // send PruneBlockHashCompleted on finish
+ }
}Also applies to: 1023-1029
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/actors/src/cache_service.rs` around lines 545 - 573,
spawn_prune_block_hash_task currently spawns an unbounded OS thread per prune
request, leading to resource pressure; change it to enqueue work on a
serialized/bounded worker (reuse the existing PruneTxidsFromCachedDataRoots
queue/worker pattern) instead of calling std::thread::spawn: create or use the
component that processes a single-threaded/prioritized queue, push a (data_root,
block_hash) task into that queue, and have the worker call the same
db.update_eyre closure with cached_data_root_by_data_root and
db_tx.put::<CachedDataRoots> to perform the prune and logging; ensure the queue
is bounded/backpressured and that errors are logged the same way as the current
warn block.
| vdf_terminal_finalize_via( | ||
| &mut coordinator, | ||
| &self.inner.service_senders.block_tree, | ||
| hash, | ||
| ValidationError::VdfValidationFailed(vdf_error.to_string()), | ||
| ); |
There was a problem hiding this comment.
Handle failed VDF terminal-result dispatches.
These terminal arms ignore the bool from vdf_terminal_finalize_via(). If the block-tree channel is closed, Invalid/ParentMissing verdicts are dropped without the fallback record_validation_finished / record_block_validation_error handling you already apply in the concurrent panic and retry-cap paths.
💡 Minimal follow-up
- vdf_terminal_finalize_via(
+ if !vdf_terminal_finalize_via(
&mut coordinator,
&self.inner.service_senders.block_tree,
hash,
ValidationError::VdfValidationFailed(vdf_error.to_string()),
- );
+ ) {
+ self.inner.chain_sync_state.record_validation_finished(&hash);
+ self.inner.chain_sync_state.record_block_validation_error(
+ format!("block={} error=vdf validation failed: {}", hash, vdf_error),
+ );
+ }Mirror the same fallback in the ParentMissing arm, or fold that fallback into vdf_terminal_finalize_via(...) so the helper cannot be ignored.
Also applies to: 455-461, 899-907
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/actors/src/validation_service.rs` around lines 424 - 429, The call to
vdf_terminal_finalize_via(...) ignores its bool return, so when the block-tree
channel is closed the Invalid/ParentMissing terminal verdicts get dropped
without executing the existing fallback logic (record_validation_finished /
record_block_validation_error); update the ParentMissing and other terminal arms
(locations using vdf_terminal_finalize_via in validation_service.rs, including
the blocks around lines shown) to check the returned bool and if false run the
same fallback handling (call record_validation_finished or
record_block_validation_error as appropriate), or refactor
vdf_terminal_finalize_via to return/propagate its success and perform the
fallback internally so callers cannot ignore the failure.
…reason diagnostic clarity `classify_poa_join_error` previously returned `ValidationCancelReason::ChannelClosed` to mean "PoA spawn_blocking handle was externally aborted" (sibling stage produced a verdict first, runtime tear-down, etc.). Routing was correct — both are SoftInternal per `IS_INTERNAL = true` — but the shared variant conflated two distinct local-fault modes in operator log greps and the `validation_cancelled` metric label. Adds `ValidationCancelReason::PoAAborted` so logs and dashboards can tell apart "broadcast channel closed at shutdown" from "PoA blocking task aborted mid-flight". Updates the compile-time exhaustive-check tripwire, `Display`, the metric_label / is_node_fault loops, and both `rstest` enumerations. No routing/dispatch behaviour change. Addresses M1 in the branch review.
…e_fault mapping
The prior `TODO(P2 follow-up): differentiate DB I/O from logic errors` at the
`calculate_expired_ledger_fees` call site suggested splitting failures into typed
DB-vs-logic sub-errors. On audit (2026-05-21) the function takes no peer-derived
input — every parameter is locally derived (parent epoch snapshot, validator-
computed block height, local block index, local mempool, local DB). The output is
what the validator EXPECTS the peer's shadow txs to look like; peer comparison
happens downstream.
Every failure path is therefore a node-side fault:
- MDBX I/O failure → NodeFault.
- Local snapshot/index inconsistency → still NodeFault (two honest nodes cannot
disagree on local snapshot state; same rationale as `SnapshotInvariant`).
- Internal arithmetic over local data → NodeFault.
No peer-attributable mode exists, so the blanket `node_fault` mapping is correct.
Replaces the TODO with the audit rationale.
Addresses M4 in the branch review (correction — finding was a false positive).
…re_waiters The test's name claims it pins the orphan invariant — that after `waiter_a` is dropped, the sender entry persists so `waiter_b` subscribes to the SAME notifier rather than racing a fresh one. The previous assertions only validated the delivery-time pop; if a future regression makes `Receiver::drop` evict the sender, the test would still pass (waiter_b subscribes to a freshly-created notifier, the test payload still wakes it). Adds an explicit `payload_senders.contains` assertion after `drop(waiter_a)` and before `waiter_b` subscribes, with a comment naming the regression class. Addresses M5 in the branch review.
The test helper does a single cache check before `block_receiver(...)`, then immediately enters `changed().await`. If a payload arrives between the cache check and the subscription, the wake-side has already popped + dropped the sender; the freshly-subscribed `Receiver` sees a fresh channel with no pending value and waits until timeout even though the payload is already sitting in `cache.payloads`. Mirrors the double cache check pattern that the production `wait_for_sealed_block` already uses (the inline comment there explicitly names the race), with a comment pointing back to that rationale.
`read_block_from_state` previously returned `ReceiverLagged` on any broadcast
overflow. Under concurrent test load the channel can lag for reasons unrelated
to `block_hash` (e.g. a different block producing a burst of events); returning
eagerly turns successful validations into phantom `ReceiverLagged` outcomes in
busy multi-node tests.
On `Lagged(missed)`, re-poll `block_tree_guard.get_block_and_status(block_hash)`
before surfacing the lag:
* Terminal non-Scheduled state → return `StoredOnNode(chain_state)` (block
survived; lag was unrelated).
* Absent AND we'd previously seen it scheduled → return
`NoDiscardEventObserved` (matches the outer-poll fall-through; we know it
was discarded, just can't recover the error).
* Still scheduled, or absent without prior scheduled sighting → fall through
to `ReceiverLagged { missed }` (the discard event truly may have been
among the missed messages).
Removes inline references to review-pass identifiers (H1-H6 / M1-M6 / L1-L8 /
C1-C2 severity tags, R2/R7 review-round tags, "Bug 1/2/3" labels, "Post-M2"
phrasing, dated "2026-05-20 audit" markers, and commit-hash citations like
`c3fd8963a` / `1f85702fc` / `ca192a999` / `4a0d407cf` / `10bb8c222`). The
information each tag encoded is preserved by describing the condition / bug
class / invariant in plain terms.
What stays (intentional cross-branch coordination):
* `MERGE-BLOCKER(C1)`, `MERGE-PAIR(C1)`, `MERGE-NOTE(C2)` and the test sites
that reference them. The merge agent uses the literal `C1` tag to grep
for all sites tied to the off-tree PoA lookup fix gated on the
`fix-cdr-block-set` merge.
* `PHASE-B` operator-action label (drops the (H3) suffix).
Also folds in two small follow-on tweaks that fit the topic:
* Adds the `PoAAborted` rstest case to the
`validation_cancelled_softinternal_skips_lru_and_counter` table and the
`validation_cancelled_converts_per_reason` loop (closes the test-array
coverage gap that the prior PoAAborted commit missed).
* Re-words the `irys.block.soft_internal_discard_total` metric description
to drop "Phase A of H3 instrumentation" in favour of a self-contained
description of the recovery-gauge contract.
No production-logic changes. All affected tests still pass.
…s peer-attributable The publish-ledger ingress-proof-reward underflow path previously upgraded to SnapshotTreasuryUnderflow (node fault → panic+restart) whenever prior expired-ledger payouts or commitment refunds had mutated treasury_balance. Since those snapshot-derived inputs are deterministic from canonical state, every honest validator with the same parent reaches the same running balance — so an underflow on a peer-supplied operand is uniformly peer-attributable, not snapshot-drift evidence. The conservative upgrade created a network-wide DoS surface: a single crafted epoch block could crash every validator simultaneously. Snapshot drift remains caught by the per-block `block.treasury != expected_treasury` comparison in block_validation.rs, so the loud-restart signal is not lost. Removes the now-unused `snapshot_tainted` field, inverts the existing regression test (renamed `..._is_snapshot_underflow` → `..._is_peer_attributable`), and updates the ShadowTxGenError variant docstrings to reflect that SnapshotTreasuryUnderflow is now constructed only for snapshot-derived operands (via deduct_from_treasury_for_payout).
The orphan-resolve branch in process_block sent AttemptReprocessingBlock raw whenever the parent was cached and not flagged is_processing, bypassing the per-block reprocessing cooldown that the gossip path enforces. A burst of orphan children for one parked parent could therefore enqueue N raw reprocess messages before the first flipped is_processing — defeating the rate guard added in dedup_and_maybe_emit_reprocess. Replace the is_block_processing-gated branch with a single call to the shared dedup/cooldown helper. KnownInFlight still suppresses the send, ParkedReadyForRetry still emits and stamps the cooldown, and ParkedCoolingDown now correctly suppresses the storm. Adds a regression test that drives the orphan-child path against a parent in the cooldown window and asserts no second emission.
…unter classify_poa_join_error's cancel branch already routes through the dedicated PoAAborted variant for log/diagnostic clarity, but the record_validation_cancellation counter — fired explicitly at the height-diff, parent-missing, and channel-closed sites — was never called on this path. Dashboards therefore could not attribute cancel volume to the PoA-abort lane. Add the counter call inside the cancel branch so the metric half of PoAAborted's split lands alongside the log half. Routing semantics are unchanged; same SoftInternal outcome.
The prior comment claimed snapshot drift would surface via the downstream block.treasury != expected_treasury comparison. That is only true when iteration completes far enough to reach the comparison — a snapshot-drifted balance below the peer-supplied operand underflows mid-iteration and mis-discards the block as Invalid on the drifted node alone, before the post-loop comparison runs. Rewrite the comment to acknowledge the gap while preserving the classification rationale: single-node self-DoS is strictly preferable to network-wide loud-restart on a crafted block, so the peer-attributable classification stays correct. No code change.
Describe the changes
A clear and concise description of what the bug fix, feature, or improvement is.
Related Issue(s)
Please link to the issue(s) that will be closed with this PR.
Checklist
Additional Context
Add any other context about the pull request here.
Summary by CodeRabbit
Bug Fixes
New Checks
Metrics
New Features
Tests
Documentation