Skip to content

test: stability#1142

Merged
roberts-pumpurs merged 93 commits into
masterfrom
rob/test-stability
Feb 25, 2026
Merged

test: stability#1142
roberts-pumpurs merged 93 commits into
masterfrom
rob/test-stability

Conversation

@roberts-pumpurs

@roberts-pumpurs roberts-pumpurs commented Feb 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Overhauls node shutdown, startup, and internal synchronization to eliminate hangs and race conditions — both in production and tests.

Shutdown

  • Async-first: replaced blocking JoinHandle joins with oneshot::Receiver channels for reth, VDF, and actor threads
  • Bounded timeouts on every path: API 5s, Gossip 5s, VDF 10s, Actors 25s, Reth 60s, watchdog 45s (was 30s)
  • Orphan cleanup: stop_and_join_with_timeout() on TokioServiceHandle aborts tasks that exceed their budget
  • DB lock draining: waits up to 5s for DatabaseProvider Arc refs to drop before returning from stop(), preventing MDBX lock contention on restart
  • Top-level CancellationToken propagates to VDF, backfill, and actors for coordinated cancellation

Startup

  • Two-phase init: bind_listeners() then new_with_listeners() — TCP ports are known before DB/config creation
  • Event-driven sync: block state events subscribed before initial_sync(); auto-stake uses broadcast quiescence detection instead of sleep(2s)

P2P

  • Gossip broadcast task detects early exit and skips redundant stop — prevents error spam during shutdown

Test robustness

  • Updated wait helpers
  • send_block_and_read_state() subscribes before sending to prevent validation races
  • When we wait for sync / reorg / etc we enable the vdf on the node so that block validatoin can progress.
  • Fork recovery uses cumulative difficulty comparison instead of block-count assumptions
  • New #[diag_slow] proc macro emits periodic warnings for hung async operations in CI
  • Dead code cleanup across test files

Summary by CodeRabbit

  • Documentation

    • Added Codex second‑opinion workflow, flaky‑test debugging guide, and refreshed integration/test authoring guidance.
  • Improvements

    • Event‑driven, more deterministic test waits and unified validation path.
    • Token‑driven asynchronous shutdown with longer shutdown budgets and per‑service timeouts.
    • New diagnostic attribute that emits periodic runtime warnings for long‑running operations.
  • Tests

    • Wide test renames, new wait/utility helpers, and richer diagnostics to reduce flakiness.
  • Chores

    • Added a new workspace crate and disabled automatic test retries.

@coderabbitai

coderabbitai Bot commented Feb 16, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Refactors node shutdown to a token-driven async model with per-subsystem completion receivers; adds a proc-macro diag_slow; migrates many tests to event-driven wait helpers and unified validation helpers; expands test utilities and diagnostics; adds workspace proc-macro crate and small manifest/runtime tweaks. (43 words)

Changes

Cohort / File(s) Summary
Shutdown & Core Runtime
crates/chain/src/chain.rs, crates/chain/src/main.rs, crates/vdf/src/vdf.rs, crates/types/src/arbiter_handle.rs, crates/utils/utils/src/shutdown.rs
Replace JoinHandle joins with CancellationToken + oneshot completion receivers; thread shutdown through token; add per-subsystem completion receivers and timeouts; remove CloneableJoinHandle; add stop_with_timeout; increase GRACEFUL_SHUTDOWN_TIMEOUT (30→45s).
Diagnostics Macro Crate
crates/macros/diag-slow/Cargo.toml, crates/macros/diag-slow/src/lib.rs
Add new proc-macro crate irys-macros-diag-slow and #[diag_slow(...)] attribute that emits periodic diagnostics for long-running async functions.
Test utilities & event-driven waits (huge)
crates/chain/tests/utils.rs, crates/chain/tests/*
Introduce many wait helpers (wait_for_block_event/at_height/in_index/containing_tx, wait_for_evm_block, wait_for_tx_included, wait_for_sm_hashes, wait_for_reorg_subscribed, future_or_mine_on_timeout); change read_block_from_state to accept external event receiver; add diag_slow instrumentation; expose new public helpers used across tests.
Validation helpers & tests
crates/chain/tests/validation/mod.rs, various crates/chain/tests/validation/*
Add send_block_and_read_state and replace prior send/read two-step patterns in validation tests; update imports and assertions to use unified event-driven validation helper.
Block-production, index & gossip
crates/chain/tests/block_production/*, crates/chain/tests/multi_node/*, crates/p2p/src/gossip_service.rs, crates/actors/src/block_tree_service.rs
Adopt event-driven waits and subscribe-before-sync patterns; add gossip propagation/wait steps; capture early-exit gossip broadcast results; ignore BlockValidationFinished during shutdown drain in block-tree service.
Test rename & synchronization sweep
many crates/chain/tests/*, crates/p2p/src/tests/integration/mod.rs
Mass renames of tests (heavy/slow prefixes), switch test attributes to test_log::test where applied, and update call sites to new wait APIs (wait_for_block_at_height, wait_for_block_event, wait_for_block_in_index, wait_for_evm_block).
EVM/provider → wait_for_evm_block migration
multiple crates/chain/tests/*, crates/chain/tests/external/*
Replace direct provider block lookups with wait_for_evm_block / index-wait helpers to avoid races when reading EVM transactions/state.
Promotion / data-sync diagnostics & helpers
crates/chain/tests/promotion/*, crates/chain/tests/data_sync/*
Add slot-based sync expectations, diagnostics, mine_until_commitment_included, assert_shadow_tx_in_block, and wrappers like future_or_mine_on_timeout to make sync checks deterministic and more debuggable.
Workspace & manifests
Cargo.toml, crates/chain/Cargo.toml, crates/vdf/Cargo.toml, crates/macros/diag-slow/Cargo.toml, .config/nextest.toml
Add crates/macros/diag-slow workspace member and irys-macros-diag-slow workspace dependency; add quote and tokio-util; set nextest profile retries 1→0.
Misc runtime & p2p tweaks
crates/p2p/src/gossip_service.rs, crates/actors/src/services.rs, crates/actors/src/block_tree_service.rs
Add wait_until_broadcast_idle helper; capture gossip broadcast early-exit result; adjust block-tree shutdown drain behavior and error mapping for channel closes.

Sequence Diagram(s)

sequenceDiagram
    participant Main as Main (startup/stop)
    participant InitServices as init_services_thread<br/>oneshot rx
    participant InitReth as init_reth_thread<br/>oneshot rx
    participant InitVDF as init_vdf_thread<br/>oneshot rx
    participant Subsys as Subsystems (Actor / Reth / VDF)
    participant Stop as stop()/Shutdown

    Main->>Main: create CancellationToken (shutdown_token)
    Main->>InitServices: start(token)
    Main->>InitReth: start(token)
    Main->>InitVDF: start(token)

    InitServices-->>Main: return actor_done_rx (oneshot)
    InitReth-->>Main: return reth_done_rx (oneshot)
    InitVDF-->>Main: return vdf_done_rx (oneshot)

    Note over Main,Subsys: runtime operates

    Main->>Stop: call stop()
    Stop->>Main: token.cancel()
    Main->>Subsys: subsystems observe token cancelled
    Subsys-->>Main: signal completion via respective oneshot receivers

    Main->>Main: await reth_done_rx (with timeout)
    Main->>Main: await vdf_done_rx (with timeout)
    Main->>Main: await actor_done_rx (with timeout)
    Main->>Main: finalize resource release
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • DanMacDonald
  • antouhou
  • JesseTheRobot
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'test: stability' is vague and generic, using non-descriptive terms that don't meaningfully convey the scope or intent of this substantial changeset. Consider a more specific title that highlights the main architectural change, e.g., 'refactor: async token-driven shutdown and test synchronization improvements' or 'chore: eliminate shutdown hangs and test flakiness via cancellation tokens and event-driven waits'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch rob/test-stability

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@roberts-pumpurs roberts-pumpurs self-assigned this Feb 16, 2026
roberts-pumpurs and others added 27 commits February 16, 2026 16:50
Previously, when the actor main thread received a `None` shutdown reason
(e.g. if the reth thread panicked before sending), the VDF shutdown
signal was never sent but `vdf_thread.join()` was still called, causing
an infinite hang during `stop()`. Also handle channel disconnection in
the VDF loop as a safety net so the thread exits even if the sender is
dropped without sending.
- Add `wait_for_block_in_index` helper that polls both index entry AND
  DB header, fixing the race where the index height updates before the
  block header is persisted to the database.
- Use the new helper in `heavy3_mine_ten_blocks_with_migration_depth_two`
  and `heavy3_reth_restarts_use_block_index_before_sync`.
- Rename `solution_reused_when_parent_changes_but_valid` to `heavy_` prefix
  (2-node test) and relax VDF step assertion from `>` to `>=`.
- Rename `evm_payload_with_versioned_hashes_is_rejected` and
  `evm_payload_with_withdrawals_is_rejected` to `heavy_` prefix so they
  get adequate thread resources in CI.
- Add retry logic to `find_tx_in_blocks` helper (up to 10 retries with
  1s delay) so transient block unavailability doesn't cause false
  negatives in hardfork tests.
- Add `wait_for_mempool_commitment_txs` before mining in
  `heavy_pending_pledges_test` to ensure async pending-pledge
  reprocessing completes before block production.
- Rename remaining blobs_rejected tests (`evm_payload_with_blob_gas_used`
  and `evm_payload_with_excess_blob_gas`) to `heavy_` prefix for
  adequate CI thread allocation.
- Rename `test_auto_stake_pledge` to `heavy_` prefix (runs 2 full nodes).
- Replace arbitrary 2-second sleep in auto stake & pledge task with
  `sync_state.wait_for_sync()` so the task waits for block sync to
  complete rather than racing against a fixed delay.
Add block_quiescence helper to IrysNodeTest that subscribes to
BlockStateUpdated events and waits until no events arrive for an idle
period (subscribe-before-act pattern to avoid missing events).

In chain.rs, subscribe before initial_sync and await quiescence in the
spawned auto stake task, replacing the fragile 2-second sleep.

Document the pattern in the integration test skill.
Move from flat file to skill-name/SKILL.md structure for proper
discovery. Improve description (CSO-compliant triggers only), add
overview, input contract, decision flowchart, common mistakes, and
"when not to use" section.
…tion

read_block_from_state() subscribed to the broadcast channel after blocks
were already sent for validation, causing a race where fast validation
could complete and discard the block before the subscription was created.
This made tests flaky with "Timeout waiting for block validation" errors.

Move the subscription before the send at all 24 call sites, and add a
send_block_and_read_state() helper for the common pattern.
Reduced from 269 to 107 lines by removing exhaustive API tables
(agent can read utils.rs directly) and adding an example tests table
pointing to real tests for each common pattern.
Replace wait_until_height (1s polling) with wait_for_block_at_height
(event-driven) in mine_blocks() and mine_block() to eliminate CPU
starvation flakes in CI. Bump slow_heavy3_mine_ten_blocks to heavy4
for autonomous mining thread headroom.
Replace implicit timing dependencies with explicit state synchronization:
- heavy_test_mine_tx: poll for mempool included_height after block confirmation
- heavy_test_multi_node_epoch_replay: wait for peer to sync epoch blocks via gossip
- heavy_unpledge_all_partitions_refund_flow: wait for genesis to receive peer's block
- heavy_solution_discarded_vdf_too_old: wait for node1 to sync node2's blocks before resume
…ndition

Two independent fixes for heavy_test_aurora_hardfork_recovery_peer_sync flake:

1. BlockTreeService shutdown resilience (block_tree_service.rs):
   - Make chunk_migration channel send non-fatal on close (warn instead of error)
   - Make shutdown drain loop tolerant of downstream channel closure

2. Hardfork test race condition (hardfork_tests.rs):
   - Replace wait_until_block_index_height with wait_for_block_in_index
   - Root cause: in-memory block index updates synchronously but MDBX
     persistence happens asynchronously via broadcast event in mempool
     service. The old wait returned before DB writes completed.
Replace wait_until_block_index_height (checks only in-memory index) with
wait_for_block_in_index (checks index + MDBX) before calling
get_block_by_height_from_index which reads from the database.
Replace polling wait_until_height_confirmed (1s sleep loop) with
event-driven wait_for_block_at_height + block_quiescence in the
ema_forks test. The test intentionally gossips only the tip block
to exercise the background block sync path. The old polling approach
wasted timeout budget under CI load causing flaky timeouts.

Also revert nextest retries to 0 (masks flakiness instead of fixing it).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
crates/chain/tests/multi_node/mempool_tests.rs (1)

340-344: ⚠️ Potential issue | 🟡 Minor

Use #[test_log::test(tokio::test)] for this tracing-heavy async test.

This test explicitly sets RUST_LOG and calls initialize_tracing(), but is annotated with #[tokio::test]. Per the coding guidelines it should use #[test_log::test(tokio::test)].

🔧 Proposed fix
-#[tokio::test]
+#[test_log::test(tokio::test)]
 async fn heavy_pending_pledges_test() -> eyre::Result<()> {
-    std::env::set_var("RUST_LOG", "debug");
-    initialize_tracing();

As per coding guidelines: "Use #[test_log::test(tokio::test)] for tests that require tracing output".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/multi_node/mempool_tests.rs` around lines 340 - 344, The
test heavy_pending_pledges_test is currently annotated with #[tokio::test] but
configures tracing (sets RUST_LOG and calls initialize_tracing()), so change the
test attribute to use the test_log wrapper: replace #[tokio::test] with
#[test_log::test(tokio::test)] on the heavy_pending_pledges_test function to
ensure tracing output is captured per project guidelines.
crates/chain/tests/data_sync/sync_partition_data_tests.rs (1)

197-245: 🧹 Nitpick | 🔵 Trivial

Collapse repeated per-node slot evaluation into a helper.

The genesis/peer1/peer2 expectation checks are structurally identical. Extracting one helper would reduce copy/paste drift and keep future expectation changes consistent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/data_sync/sync_partition_data_tests.rs` around lines 197 -
245, The three repeated blocks that call check_storage_module_chunks and
slot_matches_expectation for genesis, peer1, and peer2 should be collapsed into
a single helper function (e.g., evaluate_node or check_node_slots) that takes
the node handle, a node label string, and the expectations (slot0_expectation,
slot1_expectation) and returns a (synced: bool, diag: String) tuple; inside the
helper call check_storage_module_chunks for DataLedger::Publish and
DataLedger::Submit for slots 0 and 1 and reuse slot_matches_expectation to
compute booleans and diagnostics, then replace the
genesis_synced/peer1_synced/peer2_synced assignments and
last_genesis_diag/last_peer1_diag/last_peer2_diag formatting with calls to this
helper to eliminate duplication and ensure consistent behavior.
♻️ Duplicate comments (11)
crates/chain/tests/block_production/unstake_refund.rs (1)

1022-1022: heavy3_ prefix is still unexplained — prior review concern unresolved.

The numeric suffix introduced by renaming heavy_unpledge_and_unstake_concurrent_success_flowheavy3_unpledge_and_unstake_concurrent_success_flow is still undocumented. All other tests in this file continue to use plain heavy_ prefixes. If heavy3_ denotes a diagnostic/timeout tier associated with #[diag_slow], a brief doc-comment above the function would clarify the convention for future maintainers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/block_production/unstake_refund.rs` at line 1022, The test
function heavy3_unpledge_and_unstake_concurrent_success_flow has an unexplained
numeric prefix "heavy3_"—add a concise doc-comment immediately above the
function that documents the naming convention (e.g., that "heavy"/"heavyN"
denotes test weight/diagnostic/timeout tier and that "3" maps to the
#[diag_slow] or specific timeout/diagnostic level used) so future maintainers
understand the suffix; reference the function name
heavy3_unpledge_and_unstake_concurrent_success_flow and, if relevant, mention
the association with #[diag_slow] and any expected timeout/behavior in that
single-line doc-comment.
crates/chain/tests/multi_node/mempool_tests.rs (1)

1924-1925: ⚠️ Potential issue | 🟡 Minor

slow_heavy3_test_evm_gossip still uses #[tokio::test] despite setting RUST_LOG and calling initialize_tracing().

This is unchanged from the previously flagged concern — the attribute should be #[test_log::test(tokio::test)].

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/multi_node/mempool_tests.rs` around lines 1924 - 1925, The
test function slow_heavy3_test_evm_gossip is still annotated with #[tokio::test]
even though you set RUST_LOG and call initialize_tracing(); replace the
attribute with #[test_log::test(tokio::test)] on the slow_heavy3_test_evm_gossip
function so the test harness initializes logging/tracing properly when running
under Tokio.
crates/actors/src/services.rs (1)

228-237: 🧹 Nitpick | 🔵 Trivial

() return type still collapses three distinct exit causes.

All three exit paths — idle gap elapsed, deadline exceeded, channel closed — return (), making it impossible for call sites to distinguish a healthy idle from an unexpected channel close or a deadline overrun. This was raised in a previous review on the predecessor function.

♻️ Proposed fix — discriminated return type
+#[derive(Debug, PartialEq, Eq)]
+pub enum BroadcastIdleResult {
+    /// No events arrived for the full `idle` window.
+    Idle,
+    /// The outer deadline was exceeded before the idle window could elapse.
+    Deadline,
+    /// The channel was closed.
+    Closed,
+}

 pub async fn wait_until_broadcast_idle<T: Clone>(
     rx: &mut tokio::sync::broadcast::Receiver<T>,
     idle: std::time::Duration,
     deadline: tokio::time::Instant,
-) {
+) -> BroadcastIdleResult {
     loop {
         match tokio::time::timeout_at(deadline, tokio::time::timeout(idle, rx.recv())).await {
             Ok(Ok(Ok(_))) => continue,
             Ok(Ok(Err(tokio::sync::broadcast::error::RecvError::Lagged(_)))) => continue,
-            Ok(Ok(Err(tokio::sync::broadcast::error::RecvError::Closed))) => break,
-            Ok(Err(_)) => break,
-            Err(_) => break,
+            Ok(Ok(Err(tokio::sync::broadcast::error::RecvError::Closed))) => {
+                return BroadcastIdleResult::Closed;
+            }
+            Ok(Err(_)) => return BroadcastIdleResult::Idle,
+            Err(_) => return BroadcastIdleResult::Deadline,
         }
     }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/actors/src/services.rs` around lines 228 - 237, The loop handling
tokio::time::timeout_at(deadline, tokio::time::timeout(idle, rx.recv()))
currently collapses three distinct exit causes into unit () — change the
function to return a discriminated enum (e.g., Idle, DeadlineExceeded,
ChannelClosed or similar) instead of () and have each match arm return the
appropriate variant (map Ok(Ok(Ok(_))) => Idle, Err(_) from timeout_at =>
DeadlineExceeded, Ok(Ok(Err(Closed))) => ChannelClosed, and preserve Lagged as
Idle/continue or a distinct variant if desired); update callers to handle the
new enum. Ensure to define the enum near the function and update match arms that
now use tokio::time::timeout_at, tokio::time::timeout, rx.recv(), and other
referenced symbols so exit causes are distinguishable.
crates/vdf/src/vdf.rs (1)

84-86: ⚠️ Potential issue | 🟠 Major

Cancellation can still be starved during fast-forward drain.

The while let Ok(...) loop at line 90 drains all pending fast-forward messages without re-checking shutdown_token.is_cancelled(). Under sustained fast-forward traffic, shutdown could be delayed indefinitely.

🛠️ Proposed fix
-    loop {
+    'vdf_loop: loop {
         if shutdown_token.is_cancelled() {
             tracing::info!("VDF loop: shutdown token cancelled, exiting");
             break;
         }

         // check for VDF fast forward step
         while let Ok(traced_ff_step) = fast_forward_receiver.try_recv() {
+            if shutdown_token.is_cancelled() {
+                tracing::info!("VDF loop: shutdown token cancelled during fast-forward drain, exiting");
+                break 'vdf_loop;
+            }
             let (proposed_ff_step, _entered) = traced_ff_step.into_inner();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vdf/src/vdf.rs` around lines 84 - 86, The fast-forward drain loop (the
while let Ok(...) loop in vdf.rs that consumes fast-forward messages) can starve
shutdown because it does not re-check shutdown_token.is_cancelled() while
draining; modify the loop so it checks shutdown_token.is_cancelled() between
iterations (or incorporate the token into the loop condition) and break
immediately if cancelled, ensuring the VDF loop exits promptly even under
sustained fast-forward traffic.
crates/chain/tests/block_production/unpledge_refund.rs (2)

697-709: ⚠️ Potential issue | 🟡 Minor

Inconsistent error propagation: .await? vs .await.expect("…").

Line 702 uses .await? while other equivalent wait_for_block_at_height sites (lines 88, 212, 799) use .await.expect("…") with descriptive messages. The .expect form provides better CI diagnostics.

♻️ Proposed fix
     genesis_node
         .wait_for_block_at_height(inclusion_block.height, seconds_to_wait)
-        .await?;
+        .await
+        .expect("genesis should sync peer-mined inclusion block");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/block_production/unpledge_refund.rs` around lines 697 -
709, Replace the fallible `.await?` calls with `.await.expect("...")` to match
the rest of the test and improve CI diagnostics: after calling
`peer_node.mine_block().await?` use
`genesis_node.wait_for_block_at_height(...).await.expect("timed out waiting for
genesis to receive mined block via gossip")` and likewise change the
`genesis_node.wait_for_reth_marker(...).await?` to `.await.expect("timed out
waiting for genesis Reth marker for mined block")`; locate these changes around
the calls to `peer_node.mine_block`, `genesis_node.wait_for_block_at_height`,
and `genesis_node.wait_for_reth_marker` and use descriptive messages consistent
with other occurrences.

85-89: ⚠️ Potential issue | 🟡 Minor

Misleading variable name: inclusion_block_peer is mined by genesis, not the peer.

genesis_node.mine_block() is called on line 85, but the variable name and .expect message suggest the block comes from a peer. This was flagged in a previous review and remains unaddressed.

🔤 Proposed rename
-    let inclusion_block_peer = genesis_node.mine_block().await?;
+    let inclusion_block_mined = genesis_node.mine_block().await?;
     genesis_node
-        .wait_for_block_at_height(inclusion_block_peer.height, seconds_to_wait)
+        .wait_for_block_at_height(inclusion_block_mined.height, seconds_to_wait)
         .await
-        .expect("genesis should sync peer-mined inclusion block");
+        .expect("genesis should finish processing inclusion block");
     let inclusion_block = genesis_node
-        .get_block_by_height(inclusion_block_peer.height)
+        .get_block_by_height(inclusion_block_mined.height)
         .await?;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/block_production/unpledge_refund.rs` around lines 85 - 89,
The variable name inclusion_block_peer is misleading because the block is
produced by genesis_node.mine_block(); rename it to something like
inclusion_block or genesis_mined_block and update the assertion message in the
wait_for_block_at_height().expect(...) call (currently "genesis should sync
peer-mined inclusion block") to reflect that the block was mined by genesis
(e.g., "genesis should sync its mined inclusion block"); change references to
inclusion_block_peer to the new name and ensure tests continue to compile and
run.
crates/actors/src/block_tree_service.rs (1)

183-190: ⚠️ Potential issue | 🟠 Major

Drain loop still aborts early for non-BlockValidationFinished messages during shutdown.

Skipping BlockValidationFinished is a good step, but BlockPreValidated messages can also fail during shutdown (e.g., ValidationServiceUnreachable when the validation service channel is already closed at line 381). The ? on line 189 will abort the drain and bubble an error out of start().

Consider logging and continuing for all message types during the drain, not just BlockValidationFinished:

🛠️ Proposed fix
         while let Ok(traced) = self.msg_rx.try_recv() {
             let (msg, parent_span) = traced.into_parts();
             match msg {
                 // Skip: talks to downstream services (reth FCU, migration) that
                 // may already be stopped during shutdown.
                 BlockTreeServiceMessage::BlockValidationFinished { .. } => {
                     debug!("Skipping BlockValidationFinished during shutdown drain");
                 }
-                msg => self.inner.handle_message(msg, parent_span).await?,
+                msg => {
+                    if let Err(e) = self.inner.handle_message(msg, parent_span).await {
+                        debug!("Error handling message during shutdown drain: {:?}", e);
+                    }
+                }
             }
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/actors/src/block_tree_service.rs` around lines 183 - 190, The drain
loop in start() currently only skips BlockValidationFinished and awaits
self.inner.handle_message(msg, parent_span).await? which will propagate any
error (e.g., ValidationServiceUnreachable for BlockPreValidated) and abort the
drain; change the logic so that inside the match over BlockTreeServiceMessage
you call self.inner.handle_message(msg, parent_span).await but do not use the ?
— instead match on the Result, log the error with context (including the message
variant) and continue the loop; ensure BlockValidationFinished remains skipped
and that errors from handle_message (for example when handling
BlockPreValidated) are logged and ignored during shutdown rather than returned.
.claude/skills/debugging-flaky-tests/SKILL.md (1)

185-185: ⚠️ Potential issue | 🟡 Minor

Parameter name chunk is misleading — actual argument is a boolean.

The past review noted this: actual call sites pass a bool (e.g., wait_for_block_in_index(migration_height, false, seconds_to_wait)). The parameter should reflect this.

📝 Proposed fix
-| `wait_for_block_in_index(h, chunk, secs)` | Polling DB (1s) | Need block in persistent index |
+| `wait_for_block_in_index(h, include_txs, secs)` | Polling DB (1s) | Need block in persistent index |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/debugging-flaky-tests/SKILL.md at line 185, The parameter
name chunk in function wait_for_block_in_index(h, chunk, secs) is misleading
because callers pass a boolean; rename the parameter to something like
expect_persistent (or is_persistent) and update the function signature/type
hints, its docstring, all internal logic that treats it as a boolean, and every
call site that passes a boolean (e.g., wait_for_block_in_index(migration_height,
false, seconds_to_wait)) to use the new name; ensure tests and any comments are
updated accordingly to reflect the boolean semantic.
crates/macros/diag-slow/src/lib.rs (1)

101-112: 🧹 Nitpick | 🔵 Trivial

Comment on the async move branch still mentions "state diagnostics" — misleading.

The async move branch is taken when state_expr is absent, so the comment "ownership transfer doesn't conflict with state diagnostics borrowing captured variables" references a feature that does not apply in this branch. The correct explanation is that move is safe here precisely because there is no state_expr needing to borrow outer captures.

📝 Proposed fix
     } else {
         quote! {
-            // `move` is safe here: ownership transfer doesn't conflict with
-            // state diagnostics borrowing captured variables.
+            // No `state_expr`: no diagnostic closure needs to borrow outer
+            // captures, so `async move` safely transfers ownership into the
+            // future and avoids lifetime issues.
             let __diag_fut = async move `#body`;
         }
     };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/macros/diag-slow/src/lib.rs` around lines 101 - 112, The comment on
the `async move` branch is misleading because that branch is executed when
`state_expr` is None; update the comment around `fut_decl` for the `else` branch
(the `let __diag_fut = async move `#body`;` case) to state that `move` is safe
because there is no `state_expr` that would need to borrow captured variables
(i.e., ownership transfer is fine when `state_expr` is absent), and remove or
replace the reference to "state diagnostics" borrowing captured variables to
accurately reflect the absence of `state_expr`.
crates/macros/diag-slow/Cargo.toml (1)

1-19: LGTM — quote is now a workspace dependency as previously recommended.

The Cargo.toml is correctly configured: proc-macro = true, syn = { workspace = true, features = ["full"] }, and quote = { workspace = true } now that the root workspace declares quote = "1.0".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/macros/diag-slow/Cargo.toml` around lines 1 - 19, No change required:
the Cargo.toml for the irys-macros-diag-slow crate is correctly
configured—ensure proc-macro = true remains in the [lib] section and the
dependencies keep syn = { workspace = true, features = ["full"] } and quote = {
workspace = true } so the crate uses the workspace-declared quote = "1.0".
crates/chain/tests/utils.rs (1)

1447-1461: ⚠️ Potential issue | 🟠 Major

Handle RecvError::Lagged as recoverable in reorg waits.

Line 1458 returns an error on lag, which can fail tests even though the receiver can continue. This should continue waiting until timeout/closed instead.

💡 Suggested fix
-        let timeout_duration = Duration::from_secs(seconds_to_wait as u64);
-        match tokio::time::timeout(timeout_duration, reorg_rx.recv()).await {
-            Ok(Ok(reorg_event)) => {
-                info!(
-                    "Reorg detected: {} blocks in old fork, {} in new fork, fork at height {}, new tip: {}",
-                    reorg_event.old_fork.len(),
-                    reorg_event.new_fork.len(),
-                    reorg_event.fork_parent.height,
-                    reorg_event.new_tip
-                );
-                Ok(reorg_event)
-            }
-            Ok(Err(tokio::sync::broadcast::error::RecvError::Lagged(skipped))) => Err(eyre::eyre!(
-                "Reorg broadcast receiver lagged and skipped {} events",
-                skipped
-            )),
-            Ok(Err(tokio::sync::broadcast::error::RecvError::Closed)) => {
-                Err(eyre::eyre!("Reorg broadcast channel closed"))
-            }
-            Err(_) => {
-                let current_tip = self.get_max_difficulty_block().block_hash;
-                let state = self.diag_wait_state().await;
-                Err(eyre::eyre!(
-                    "Timeout: No reorg event received within {} seconds (start_tip={}, current_tip={}). State: {}",
-                    seconds_to_wait,
-                    start_tip,
-                    current_tip,
-                    state
-                ))
-            }
-        }
+        let deadline = Instant::now() + Duration::from_secs(seconds_to_wait as u64);
+        loop {
+            let remaining = deadline.saturating_duration_since(Instant::now());
+            if remaining.is_zero() {
+                let current_tip = self.get_max_difficulty_block().block_hash;
+                let state = self.diag_wait_state().await;
+                return Err(eyre::eyre!(
+                    "Timeout: No reorg event received within {} seconds (start_tip={}, current_tip={}). State: {}",
+                    seconds_to_wait,
+                    start_tip,
+                    current_tip,
+                    state
+                ));
+            }
+            match tokio::time::timeout(remaining, reorg_rx.recv()).await {
+                Ok(Ok(reorg_event)) => {
+                    info!(
+                        "Reorg detected: {} blocks in old fork, {} in new fork, fork at height {}, new tip: {}",
+                        reorg_event.old_fork.len(),
+                        reorg_event.new_fork.len(),
+                        reorg_event.fork_parent.height,
+                        reorg_event.new_tip
+                    );
+                    return Ok(reorg_event);
+                }
+                Ok(Err(tokio::sync::broadcast::error::RecvError::Lagged(skipped))) => {
+                    warn!("Reorg receiver lagged; skipped {skipped} events, continuing");
+                    continue;
+                }
+                Ok(Err(tokio::sync::broadcast::error::RecvError::Closed)) => {
+                    return Err(eyre::eyre!("Reorg broadcast channel closed"));
+                }
+                Err(_) => continue,
+            }
+        }
tokio::sync::broadcast::Receiver::recv RecvError::Lagged semantics: is Lagged recoverable and should receiver continue waiting?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/utils.rs` around lines 1447 - 1461, The test currently
treats tokio::sync::broadcast::error::RecvError::Lagged as a terminal error when
awaiting reorg_rx.recv(), causing spurious failures; change the handling in the
match around tokio::time::timeout(timeout_duration, reorg_rx.recv()).await so
that when you get RecvError::Lagged(skipped) you log or warn (include skipped)
and then continue waiting (i.e., do not return Err, loop until a real event,
timeout, or closed), leaving the Ok(Ok(reorg_event)) and other error branches
intact; look for the match that calls reorg_rx.recv() and adjust the Lagged arm
to be recoverable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/writing-integration-tests/SKILL.md:
- Line 107: The phrase "very large epochs" in the sentence starting with
"Epoch-aware tests need `testing_with_epochs(n)`" is vague—replace it with a
more precise descriptor such as "epochs sized for production, not tests" or
"epochs orders-of-magnitude larger than typical test runs" (or "extremely large"
if you prefer a shorter change) so readers know why `testing_with_epochs(n)` is
needed; update the line in SKILL.md that contains "Epoch-aware tests need
`testing_with_epochs(n)`" accordingly.

In `@crates/chain/src/chain.rs`:
- Around line 208-211: The shutdown send failure is an expected outcome when the
reth receiver has already been dropped, so replace the error-level log in the
self.reth_shutdown_sender.send(reason).await error branch with a non-error log
(e.g., debug! or info!) that notes the reth shutdown channel was already closed
and include the send error for context; update the logging call near the send
invocation to use debug!/info! and a concise message like "reth shutdown channel
closed" plus the error details.

In `@crates/chain/tests/api/hardfork_tests.rs`:
- Line 763: The test function
heavy3_test_borealis_accepts_update_reward_address_post_activation uses an
inconsistent "heavy3_test_" prefix; rename the function to
heavy_test_borealis_accepts_update_reward_address_post_activation to match the
existing heavy_test_ convention, update any references/usages (test attributes,
mod imports or calls) that refer to
heavy3_test_borealis_accepts_update_reward_address_post_activation, and run the
test suite to ensure no other code relies on the old name.
- Around line 164-170: The tracing::debug call is duplicating values because
scanned and skipped are passed as structured fields and also interpolated in the
format string; update the tracing::debug invocation (the call using
tracing::debug! with ?tx_id, scanned, skipped, max_height) to remove the
{scanned} and {skipped} interpolations from the message and leave them only as
structured fields (e.g., change the message to something like "tx not found in
scanned blocks (some heights failed to load)" or simply "tx not found" while
keeping ?tx_id, scanned, skipped, max_height as fields).

In `@crates/chain/tests/block_production/block_production.rs`:
- Around line 1173-1176: The wait predicate currently accepts any state except
NotOnchain(Unknown), which can include transient states and cause a race; update
the predicate passed to wait_for_block_event to require ev.block_hash ==
new_block_hash && ev.state == ChainState::Onchain so the test waits explicitly
for ChainState::Onchain before proceeding (referencing wait_for_block_event,
ChainState::Onchain, new_block_hash and new_block_state).

In `@crates/chain/tests/block_production/difficulty_adjustment.rs`:
- Around line 146-152: The test uses a hardcoded 60s timeout when calling
wait_for_block_event(&mut block_state_rx, 60, ...) which contradicts the earlier
max_seconds value (max_seconds) and a prior review asking for ≤10s; either
switch this call to use the existing max_seconds variable or introduce a clearly
named constant like validation_timeout = 60 with a brief doc comment explaining
why it must exceed max_seconds, then pass that constant into
wait_for_block_event to remove the magic number and make the intent explicit.

In `@crates/chain/tests/data_sync/sync_partition_data_tests.rs`:
- Around line 175-186: The test duplicates numeric literals for slot
expectations (e.g., expected_total = 60, max_data = 50) which can drift from the
setup; replace these hard-coded values in SlotSyncExpectation constructions
(slot0_expectation and slot1_expectation) by deriving them from the shared setup
constants used to build the test data (e.g., total_items, data_items_per_slot or
whatever constants configure packed vs data counts) so expected_total, max_data
and min_packed are computed from those shared variables instead of repeating
60/50/10 literals.
- Around line 246-250: The test captures periodic sync snapshots into
last_genesis_state / last_peer1_state / last_peer2_state but assertions may use
stale values; before any assert that reports node sync state, refresh those
variables by calling genesis_node.sync_state_snapshot().await,
peer1_node.sync_state_snapshot().await, and
peer2_node.sync_state_snapshot().await and reassign to last_genesis_state /
last_peer1_state / last_peer2_state immediately prior to the failure asserts (do
this in each assertion block where snapshot freshness matters, e.g., the blocks
currently setting those variables periodically).

In `@crates/chain/tests/validation/mod.rs`:
- Around line 68-70: Handle RecvError::Lagged explicitly in
read_block_from_state: locate the loop in read_block_from_state that polls
event_rx.try_recv() and add a match arm for RecvError::Lagged which logs a
warning (similar to wait_for_height_update and wait_for_block_event) and then
continues the outer polling loop instead of silently dropping events; ensure the
log message mentions that the broadcast receiver lagged and include relevant
identifiers (e.g., block hash or event type) for observability.

In `@crates/p2p/src/gossip_service.rs`:
- Around line 412-415: The current code logs a warn! unconditionally for the
broadcast_task_handle.wait_for_exit() result; change the handling to match on
broadcast_res (the Result returned by broadcast_task_handle.wait_for_exit()) and
log at info or debug when it is Ok(()) (or Ok value), and only emit warn! when
it is Err(e). Keep the assignment early_exit_result = Some(broadcast_res) but
perform the conditional logging before/after that assignment so normal/expected
exits are logged at a lower level and only errors use warn!.

---

Outside diff comments:
In `@crates/chain/tests/data_sync/sync_partition_data_tests.rs`:
- Around line 197-245: The three repeated blocks that call
check_storage_module_chunks and slot_matches_expectation for genesis, peer1, and
peer2 should be collapsed into a single helper function (e.g., evaluate_node or
check_node_slots) that takes the node handle, a node label string, and the
expectations (slot0_expectation, slot1_expectation) and returns a (synced: bool,
diag: String) tuple; inside the helper call check_storage_module_chunks for
DataLedger::Publish and DataLedger::Submit for slots 0 and 1 and reuse
slot_matches_expectation to compute booleans and diagnostics, then replace the
genesis_synced/peer1_synced/peer2_synced assignments and
last_genesis_diag/last_peer1_diag/last_peer2_diag formatting with calls to this
helper to eliminate duplication and ensure consistent behavior.

In `@crates/chain/tests/multi_node/mempool_tests.rs`:
- Around line 340-344: The test heavy_pending_pledges_test is currently
annotated with #[tokio::test] but configures tracing (sets RUST_LOG and calls
initialize_tracing()), so change the test attribute to use the test_log wrapper:
replace #[tokio::test] with #[test_log::test(tokio::test)] on the
heavy_pending_pledges_test function to ensure tracing output is captured per
project guidelines.

---

Duplicate comments:
In @.claude/skills/debugging-flaky-tests/SKILL.md:
- Line 185: The parameter name chunk in function wait_for_block_in_index(h,
chunk, secs) is misleading because callers pass a boolean; rename the parameter
to something like expect_persistent (or is_persistent) and update the function
signature/type hints, its docstring, all internal logic that treats it as a
boolean, and every call site that passes a boolean (e.g.,
wait_for_block_in_index(migration_height, false, seconds_to_wait)) to use the
new name; ensure tests and any comments are updated accordingly to reflect the
boolean semantic.

In `@crates/actors/src/block_tree_service.rs`:
- Around line 183-190: The drain loop in start() currently only skips
BlockValidationFinished and awaits self.inner.handle_message(msg,
parent_span).await? which will propagate any error (e.g.,
ValidationServiceUnreachable for BlockPreValidated) and abort the drain; change
the logic so that inside the match over BlockTreeServiceMessage you call
self.inner.handle_message(msg, parent_span).await but do not use the ? — instead
match on the Result, log the error with context (including the message variant)
and continue the loop; ensure BlockValidationFinished remains skipped and that
errors from handle_message (for example when handling BlockPreValidated) are
logged and ignored during shutdown rather than returned.

In `@crates/actors/src/services.rs`:
- Around line 228-237: The loop handling tokio::time::timeout_at(deadline,
tokio::time::timeout(idle, rx.recv())) currently collapses three distinct exit
causes into unit () — change the function to return a discriminated enum (e.g.,
Idle, DeadlineExceeded, ChannelClosed or similar) instead of () and have each
match arm return the appropriate variant (map Ok(Ok(Ok(_))) => Idle, Err(_) from
timeout_at => DeadlineExceeded, Ok(Ok(Err(Closed))) => ChannelClosed, and
preserve Lagged as Idle/continue or a distinct variant if desired); update
callers to handle the new enum. Ensure to define the enum near the function and
update match arms that now use tokio::time::timeout_at, tokio::time::timeout,
rx.recv(), and other referenced symbols so exit causes are distinguishable.

In `@crates/chain/tests/block_production/unpledge_refund.rs`:
- Around line 697-709: Replace the fallible `.await?` calls with
`.await.expect("...")` to match the rest of the test and improve CI diagnostics:
after calling `peer_node.mine_block().await?` use
`genesis_node.wait_for_block_at_height(...).await.expect("timed out waiting for
genesis to receive mined block via gossip")` and likewise change the
`genesis_node.wait_for_reth_marker(...).await?` to `.await.expect("timed out
waiting for genesis Reth marker for mined block")`; locate these changes around
the calls to `peer_node.mine_block`, `genesis_node.wait_for_block_at_height`,
and `genesis_node.wait_for_reth_marker` and use descriptive messages consistent
with other occurrences.
- Around line 85-89: The variable name inclusion_block_peer is misleading
because the block is produced by genesis_node.mine_block(); rename it to
something like inclusion_block or genesis_mined_block and update the assertion
message in the wait_for_block_at_height().expect(...) call (currently "genesis
should sync peer-mined inclusion block") to reflect that the block was mined by
genesis (e.g., "genesis should sync its mined inclusion block"); change
references to inclusion_block_peer to the new name and ensure tests continue to
compile and run.

In `@crates/chain/tests/block_production/unstake_refund.rs`:
- Line 1022: The test function
heavy3_unpledge_and_unstake_concurrent_success_flow has an unexplained numeric
prefix "heavy3_"—add a concise doc-comment immediately above the function that
documents the naming convention (e.g., that "heavy"/"heavyN" denotes test
weight/diagnostic/timeout tier and that "3" maps to the #[diag_slow] or specific
timeout/diagnostic level used) so future maintainers understand the suffix;
reference the function name heavy3_unpledge_and_unstake_concurrent_success_flow
and, if relevant, mention the association with #[diag_slow] and any expected
timeout/behavior in that single-line doc-comment.

In `@crates/chain/tests/multi_node/mempool_tests.rs`:
- Around line 1924-1925: The test function slow_heavy3_test_evm_gossip is still
annotated with #[tokio::test] even though you set RUST_LOG and call
initialize_tracing(); replace the attribute with #[test_log::test(tokio::test)]
on the slow_heavy3_test_evm_gossip function so the test harness initializes
logging/tracing properly when running under Tokio.

In `@crates/chain/tests/utils.rs`:
- Around line 1447-1461: The test currently treats
tokio::sync::broadcast::error::RecvError::Lagged as a terminal error when
awaiting reorg_rx.recv(), causing spurious failures; change the handling in the
match around tokio::time::timeout(timeout_duration, reorg_rx.recv()).await so
that when you get RecvError::Lagged(skipped) you log or warn (include skipped)
and then continue waiting (i.e., do not return Err, loop until a real event,
timeout, or closed), leaving the Ok(Ok(reorg_event)) and other error branches
intact; look for the match that calls reorg_rx.recv() and adjust the Lagged arm
to be recoverable.

In `@crates/macros/diag-slow/Cargo.toml`:
- Around line 1-19: No change required: the Cargo.toml for the
irys-macros-diag-slow crate is correctly configured—ensure proc-macro = true
remains in the [lib] section and the dependencies keep syn = { workspace = true,
features = ["full"] } and quote = { workspace = true } so the crate uses the
workspace-declared quote = "1.0".

In `@crates/macros/diag-slow/src/lib.rs`:
- Around line 101-112: The comment on the `async move` branch is misleading
because that branch is executed when `state_expr` is None; update the comment
around `fut_decl` for the `else` branch (the `let __diag_fut = async move
`#body`;` case) to state that `move` is safe because there is no `state_expr` that
would need to borrow captured variables (i.e., ownership transfer is fine when
`state_expr` is absent), and remove or replace the reference to "state
diagnostics" borrowing captured variables to accurately reflect the absence of
`state_expr`.

In `@crates/vdf/src/vdf.rs`:
- Around line 84-86: The fast-forward drain loop (the while let Ok(...) loop in
vdf.rs that consumes fast-forward messages) can starve shutdown because it does
not re-check shutdown_token.is_cancelled() while draining; modify the loop so it
checks shutdown_token.is_cancelled() between iterations (or incorporate the
token into the loop condition) and break immediately if cancelled, ensuring the
VDF loop exits promptly even under sustained fast-forward traffic.

ℹ️ Review info

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b766c34 and 1bf4292.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (33)
  • .claude/skills/debugging-flaky-tests/SKILL.md
  • .claude/skills/writing-integration-tests/SKILL.md
  • Cargo.toml
  • crates/actors/src/block_tree_service.rs
  • crates/actors/src/services.rs
  • crates/chain/src/chain.rs
  • crates/chain/src/main.rs
  • crates/chain/tests/api/hardfork_tests.rs
  • crates/chain/tests/block_production/block_production.rs
  • crates/chain/tests/block_production/difficulty_adjustment.rs
  • crates/chain/tests/block_production/unpledge_refund.rs
  • crates/chain/tests/block_production/unstake_refund.rs
  • crates/chain/tests/data_sync/sync_partition_data_tests.rs
  • crates/chain/tests/external/block_production.rs
  • crates/chain/tests/integration/cache_service.rs
  • crates/chain/tests/multi_node/ema_forks.rs
  • crates/chain/tests/multi_node/fork_recovery.rs
  • crates/chain/tests/multi_node/mempool_tests.rs
  • crates/chain/tests/multi_node/sync_chain_state.rs
  • crates/chain/tests/multi_node/validation.rs
  • crates/chain/tests/promotion/data_promotion_basic.rs
  • crates/chain/tests/startup/startup.rs
  • crates/chain/tests/term_ledger_expiry/mod.rs
  • crates/chain/tests/utils.rs
  • crates/chain/tests/validation/data_tx_pricing.rs
  • crates/chain/tests/validation/invalid_perm_fee_refund.rs
  • crates/chain/tests/validation/mod.rs
  • crates/chain/tests/validation/unstake_edge_cases.rs
  • crates/macros/diag-slow/Cargo.toml
  • crates/macros/diag-slow/src/lib.rs
  • crates/p2p/src/gossip_service.rs
  • crates/p2p/src/tests/integration/mod.rs
  • crates/vdf/src/vdf.rs

- **`consensus.get_mut()` panics on non-Custom variants** - only works after `NodeConfig::testing()` which creates a `Custom` variant
- **Fund signers before starting nodes** genesis accounts are baked into genesis block
- **`wait_for_mempool` before mining on genesis** when txs originate from a peer gossip takes time
- **Epoch-aware tests need `testing_with_epochs(n)`** default config has very large epochs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Nit: "very large epochs" — consider a more precise alternative.

The static analysis tooling flagged "very" as an overused intensifier. Replacing it with a concrete descriptor (e.g., "extremely large" or simply "epochs sized for production, not tests") would be more informative.

🧰 Tools
🪛 LanguageTool

[style] ~107-~107: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ..._with_epochs(n)`** — default config has very large epochs - Block migration depth — bl...

(EN_WEAK_ADJECTIVE)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/writing-integration-tests/SKILL.md at line 107, The phrase
"very large epochs" in the sentence starting with "Epoch-aware tests need
`testing_with_epochs(n)`" is vague—replace it with a more precise descriptor
such as "epochs sized for production, not tests" or "epochs orders-of-magnitude
larger than typical test runs" (or "extremely large" if you prefer a shorter
change) so readers know why `testing_with_epochs(n)` is needed; update the line
in SKILL.md that contains "Epoch-aware tests need `testing_with_epochs(n)`"
accordingly.

Comment thread crates/chain/src/chain.rs
Comment on lines 164 to 170
tracing::debug!(
"Transaction {} not found in blocks 1..={}",
tx_id,
max_height
?tx_id,
scanned,
skipped,
max_height,
"tx not found in {scanned} blocks ({skipped} heights failed to load)"
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Structured fields scanned and skipped are double-recorded in the log event.

In tracing macros, bare-identifier fields (scanned,, skipped,) are captured as separate key-value pairs and the format-string {scanned} / {skipped} also interpolates those locals into the message field. The result is that both values appear twice in every structured log event.

Either keep the values only as structured fields (clean, machine-parseable) or embed them only in the message string — not both.

♻️ Proposed fix – remove interpolation from the message to avoid duplication
 tracing::debug!(
     ?tx_id,
     scanned,
     skipped,
     max_height,
-    "tx not found in {scanned} blocks ({skipped} heights failed to load)"
+    "tx not found in blocks"
 );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
tracing::debug!(
"Transaction {} not found in blocks 1..={}",
tx_id,
max_height
?tx_id,
scanned,
skipped,
max_height,
"tx not found in {scanned} blocks ({skipped} heights failed to load)"
);
tracing::debug!(
?tx_id,
scanned,
skipped,
max_height,
"tx not found in blocks"
);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/api/hardfork_tests.rs` around lines 164 - 170, The
tracing::debug call is duplicating values because scanned and skipped are passed
as structured fields and also interpolated in the format string; update the
tracing::debug invocation (the call using tracing::debug! with ?tx_id, scanned,
skipped, max_height) to remove the {scanned} and {skipped} interpolations from
the message and leave them only as structured fields (e.g., change the message
to something like "tx not found in scanned blocks (some heights failed to load)"
or simply "tx not found" while keeping ?tx_id, scanned, skipped, max_height as
fields).

Comment thread crates/chain/tests/api/hardfork_tests.rs Outdated
Comment thread crates/chain/tests/block_production/block_production.rs
Comment on lines +146 to +152

// 60s timeout (not max_seconds) because CI validation can be slower than mining
wait_for_block_event(&mut block_state_rx, 60, |ev| {
ev.block_hash == block.block_hash
&& matches!(ev.validation_result, ValidationResult::Valid)
})
.await?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

60s timeout may still be excessive — reviewer previously requested ≤ 10s.

The comment on line 147 explains the rationale ("CI validation can be slower than mining"), but a prior review round explicitly asked for ≤ 10s. max_seconds is already defined as 10 at line 76. If 60s is intentional, consider aligning by using a named constant (e.g., let validation_timeout = 60;) with a doc comment, rather than an inline magic number that contradicts the test's own max_seconds.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/block_production/difficulty_adjustment.rs` around lines
146 - 152, The test uses a hardcoded 60s timeout when calling
wait_for_block_event(&mut block_state_rx, 60, ...) which contradicts the earlier
max_seconds value (max_seconds) and a prior review asking for ≤10s; either
switch this call to use the existing max_seconds variable or introduce a clearly
named constant like validation_timeout = 60 with a brief doc comment explaining
why it must exceed max_seconds, then pass that constant into
wait_for_block_event to remove the magic number and make the intent explicit.

Comment on lines +175 to +186
// Slot(0) can be either mixed (data + packed) or fully packed.
// Slot(1) is expected to be fully packed in this test scenario.
let slot0_expectation = SlotSyncExpectation {
expected_total: 60,
max_data: 50,
min_packed: 10,
};
let slot1_expectation = SlotSyncExpectation {
expected_total: 60,
max_data: 0,
min_packed: 60,
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Derive slot expectations from single sources to avoid drift.

60 and 50 are duplicated across setup and expectations. Keeping these derived from shared values makes this test less brittle during future config tweaks.

♻️ Proposed refactor
-    let slot0_expectation = SlotSyncExpectation {
-        expected_total: 60,
-        max_data: 50,
-        min_packed: 10,
-    };
-    let slot1_expectation = SlotSyncExpectation {
-        expected_total: 60,
-        max_data: 0,
-        min_packed: 60,
-    };
+    let slot_total_chunks = 60usize;
+    let slot0_expectation = SlotSyncExpectation {
+        expected_total: slot_total_chunks,
+        max_data: num_chunks,
+        min_packed: slot_total_chunks.saturating_sub(num_chunks),
+    };
+    let slot1_expectation = SlotSyncExpectation {
+        expected_total: slot_total_chunks,
+        max_data: 0,
+        min_packed: slot_total_chunks,
+    };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/data_sync/sync_partition_data_tests.rs` around lines 175 -
186, The test duplicates numeric literals for slot expectations (e.g.,
expected_total = 60, max_data = 50) which can drift from the setup; replace
these hard-coded values in SlotSyncExpectation constructions (slot0_expectation
and slot1_expectation) by deriving them from the shared setup constants used to
build the test data (e.g., total_items, data_items_per_slot or whatever
constants configure packed vs data counts) so expected_total, max_data and
min_packed are computed from those shared variables instead of repeating
60/50/10 literals.

Comment on lines +246 to +250
if attempt % 5 == 0 || attempt == 79 || (genesis_synced && peer1_synced && peer2_synced) {
last_genesis_state = genesis_node.sync_state_snapshot().await;
last_peer1_state = peer1_node.sync_state_snapshot().await;
last_peer2_state = peer2_node.sync_state_snapshot().await;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Refresh sync-state snapshots immediately before failure asserts.

last_*_state is updated periodically, so assertion messages can report stale node state on failure. Capture one final snapshot right before asserting.

🛠️ Proposed fix
+    if !(genesis_synced && peer1_synced && peer2_synced) {
+        last_genesis_state = genesis_node.sync_state_snapshot().await;
+        last_peer1_state = peer1_node.sync_state_snapshot().await;
+        last_peer2_state = peer2_node.sync_state_snapshot().await;
+    }
+
     assert!(
         genesis_synced,
         "Genesis node failed to sync data. chunks={} state={}",
         last_genesis_diag, last_genesis_state
     );

Also applies to: 272-276, 284-298

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/data_sync/sync_partition_data_tests.rs` around lines 246 -
250, The test captures periodic sync snapshots into last_genesis_state /
last_peer1_state / last_peer2_state but assertions may use stale values; before
any assert that reports node sync state, refresh those variables by calling
genesis_node.sync_state_snapshot().await,
peer1_node.sync_state_snapshot().await, and
peer2_node.sync_state_snapshot().await and reassign to last_genesis_state /
last_peer1_state / last_peer2_state immediately prior to the failure asserts (do
this in each assertion block where snapshot freshness matters, e.g., the blocks
currently setting those variables periodically).

Comment on lines +68 to +70
let event_rx = node_ctx.service_senders.subscribe_block_state_updates();
send_block_to_block_tree(node_ctx, block.clone(), skip_vdf_validation).await?;
Ok(read_block_from_state(node_ctx, &block.header().block_hash, event_rx).await)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check how read_block_from_state handles RecvError::Lagged
rg -n "Lagged\|RecvError\|recv_error\|broadcast.*recv" --type rust crates/chain/tests/utils.rs | head -40

Repository: Irys-xyz/irys

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Read the read_block_from_state function
sed -n '3641,3696p' crates/chain/tests/utils.rs

Repository: Irys-xyz/irys

Length of output: 2416


🏁 Script executed:

#!/bin/bash
# Search for broadcast receiver patterns and other wait helpers
rg -n "broadcast\|\.recv\(\)\|try_recv\|event_receiver" --type rust crates/chain/tests/utils.rs | head -50

Repository: Irys-xyz/irys

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Search more broadly for broadcast receiver patterns and Lagged handling
rg -n "\.try_recv\(\)|\.recv\(\)|Lagged" --type rust crates/chain/tests/ | head -60

Repository: Irys-xyz/irys

Length of output: 843


🏁 Script executed:

#!/bin/bash
# Examine how Lagged is handled in different wait helpers
echo "=== Handler at line 1062 ===" 
sed -n '1050,1070p' crates/chain/tests/utils.rs

echo -e "\n=== Handler at line 1458 ===" 
sed -n '1445,1465p' crates/chain/tests/utils.rs

echo -e "\n=== Handler at line 3715 ===" 
sed -n '3710,3730p' crates/chain/tests/utils.rs

Repository: Irys-xyz/irys

Length of output: 3110


🏁 Script executed:

#!/bin/bash
# Search for try_recv documentation or other usages in the codebase
rg -n "try_recv" --type rust crates/chain/tests/ -A 3 | head -80

Repository: Irys-xyz/irys

Length of output: 482


🏁 Script executed:

#!/bin/bash
# Verify tokio broadcast::Receiver::try_recv behavior with Tokio documentation
web_search "tokio broadcast Receiver try_recv Lagged behavior"

Repository: Irys-xyz/irys

Length of output: 107


🌐 Web query:

tokio sync broadcast Receiver try_recv documentation TryRecvError

💡 Result:

tokio::sync::broadcast::Receiver::try_recv(&mut self) -> Result<T, TryRecvError> attempts to receive without awaiting and can fail with three TryRecvError variants: (docs.rs)

  • TryRecvError::Empty: no message is currently available but at least one Sender still exists (more messages may arrive). (docs.rs)
  • TryRecvError::Closed: all Sender handles have been dropped, so no further messages will ever be sent. (docs.rs)
  • TryRecvError::Lagged(u64): this receiver fell behind far enough that older messages were overwritten; the u64 is the number of skipped messages. After this error, the receiver’s cursor is moved forward so the next receive returns the oldest message still retained (if any). (docs.rs)

Citations:


read_block_from_state silently loses broadcast events on RecvError::Lagged — add explicit handling for consistency.

The function correctly continues polling via the outer loop when try_recv() encounters Lagged, so test failures are not caused by treating Lagged as fatal. However, validation events are silently discarded when the broadcast receiver falls behind, relying entirely on the block-tree fallback polling to detect state changes.

This pattern is inconsistent with other wait helpers in the same file (wait_for_height_update, wait_for_block_event) that explicitly match RecvError::Lagged and log warnings before continuing. Add an explicit Lagged case with logging to improve observability and align with the existing error-handling convention.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/validation/mod.rs` around lines 68 - 70, Handle
RecvError::Lagged explicitly in read_block_from_state: locate the loop in
read_block_from_state that polls event_rx.try_recv() and add a match arm for
RecvError::Lagged which logs a warning (similar to wait_for_height_update and
wait_for_block_event) and then continues the outer polling loop instead of
silently dropping events; ensure the log message mentions that the broadcast
receiver lagged and include relevant identifiers (e.g., block hash or event
type) for observability.

Comment on lines 412 to 415
broadcast_res = broadcast_task_handle.wait_for_exit() => {
warn!("Gossip broadcast exited because: {:?}", broadcast_res);
early_exit_result = Some(broadcast_res);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Conditionally log early broadcast exit severity

On Line 413, warn! is emitted even when wait_for_exit() returns Ok(()). That can reintroduce shutdown warning noise for expected exits. Please log success at info/debug and reserve warn for Err(_).

Proposed patch
-                        broadcast_res = broadcast_task_handle.wait_for_exit() => {
-                            warn!("Gossip broadcast exited because: {:?}", broadcast_res);
-                            early_exit_result = Some(broadcast_res);
-                        }
+                        broadcast_res = broadcast_task_handle.wait_for_exit() => {
+                            match &broadcast_res {
+                                Ok(()) => info!("Gossip broadcast exited before watcher shutdown"),
+                                Err(err) => warn!("Gossip broadcast exited with error: {err:?}"),
+                            }
+                            early_exit_result = Some(broadcast_res);
+                        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
broadcast_res = broadcast_task_handle.wait_for_exit() => {
warn!("Gossip broadcast exited because: {:?}", broadcast_res);
early_exit_result = Some(broadcast_res);
}
broadcast_res = broadcast_task_handle.wait_for_exit() => {
match &broadcast_res {
Ok(()) => info!("Gossip broadcast exited before watcher shutdown"),
Err(err) => warn!("Gossip broadcast exited with error: {err:?}"),
}
early_exit_result = Some(broadcast_res);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/p2p/src/gossip_service.rs` around lines 412 - 415, The current code
logs a warn! unconditionally for the broadcast_task_handle.wait_for_exit()
result; change the handling to match on broadcast_res (the Result returned by
broadcast_task_handle.wait_for_exit()) and log at info or debug when it is
Ok(()) (or Ok value), and only emit warn! when it is Err(e). Keep the assignment
early_exit_result = Some(broadcast_res) but perform the conditional logging
before/after that assignment so normal/expected exits are logged at a lower
level and only errors use warn!.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
crates/chain/tests/block_production/unpledge_refund.rs (2)

87-89: ⚠️ Potential issue | 🟡 Minor

Rename misleading inclusion block wording.

This block is mined via genesis_node.mine_block(), so the inclusion_block_peer naming and "peer-mined" expectation text are inaccurate and can mislead failure triage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/block_production/unpledge_refund.rs` around lines 87 - 89,
The test's variable and failure message are misleading: change the variable name
inclusion_block_peer to reflect it was mined by the genesis node (e.g.,
inclusion_block_genesis) and update the expect() message passed to
wait_for_block_at_height to say "genesis-mined inclusion block" (or similar)
instead of "peer-mined inclusion block"; update all references to
inclusion_block_peer in this test (including the .height access and any other
uses) to the new name to keep names consistent with genesis_node.mine_block().

699-702: ⚠️ Potential issue | 🟡 Minor

Add contextual failure message for this wait call.

This wait_for_block_at_height(...).await? path is still inconsistent with nearby call sites that use .expect("...") context, which makes timeout diagnosis harder in CI logs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/block_production/unpledge_refund.rs` around lines 699 -
702, Replace the fallible .await? on genesis_node.wait_for_block_at_height with
an explicit panic message so timeouts show context; call
wait_for_block_at_height(inclusion_block.height,
seconds_to_wait).await.expect(...) and include the height
(inclusion_block.height) and timeout (seconds_to_wait) in the message to make CI
failures clear (e.g. "genesis node did not receive block at height {height}
within {secs}s").
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@crates/chain/tests/block_production/unpledge_refund.rs`:
- Around line 87-89: The test's variable and failure message are misleading:
change the variable name inclusion_block_peer to reflect it was mined by the
genesis node (e.g., inclusion_block_genesis) and update the expect() message
passed to wait_for_block_at_height to say "genesis-mined inclusion block" (or
similar) instead of "peer-mined inclusion block"; update all references to
inclusion_block_peer in this test (including the .height access and any other
uses) to the new name to keep names consistent with genesis_node.mine_block().
- Around line 699-702: Replace the fallible .await? on
genesis_node.wait_for_block_at_height with an explicit panic message so timeouts
show context; call wait_for_block_at_height(inclusion_block.height,
seconds_to_wait).await.expect(...) and include the height
(inclusion_block.height) and timeout (seconds_to_wait) in the message to make CI
failures clear (e.g. "genesis node did not receive block at height {height}
within {secs}s").

ℹ️ Review info

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1bf4292 and 6cbe46c.

📒 Files selected for processing (1)
  • crates/chain/tests/block_production/unpledge_refund.rs

…ed log

- Wait for ChainState::Onchain explicitly instead of any non-Unknown state
- Rename heavy3_test_ → heavy_test_ for single-node borealis test
- Downgrade reth shutdown channel closed log from error to debug

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/chain/tests/block_production/block_production.rs (1)

1572-1572: ⚠️ Potential issue | 🟡 Minor

#[tokio::test] should be #[test_log::test(tokio::test)] for consistency.

Every other async test in this file uses #[test_log::test(tokio::test)]. This test also sets up tracing (via initialize_tracing() at line 1583), making it a candidate for the standard attribute. Using the plain macro leaves the test outside the unified tracing/log-capture harness used across the suite.

🛠️ Proposed fix
-#[tokio::test]
+#[test_log::test(tokio::test)]
 /// test that config option max_commitment_txs_per_block is enforced

If initialize_tracing() is intentionally kept to configure a specific subscriber, ensure it doesn't conflict with test_log's subscriber registration (e.g., guard against "a global default trace dispatcher has already been set" panics).

As per coding guidelines: "Use #[test_log::test(tokio::test)] for tests that require tracing output".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/block_production/block_production.rs` at line 1572,
Replace the plain #[tokio::test] attribute on the test with
#[test_log::test(tokio::test)] to ensure it participates in the suite-wide
tracing/log-capture harness; update the test declaration that currently uses
#[tokio::test] to #[test_log::test(tokio::test)] and, in the test body where
initialize_tracing() is called, ensure you guard against re-registering a global
subscriber (e.g., make initialize_tracing() idempotent or skip setting a default
if one is already set) so it won’t panic when test_log has already registered a
subscriber.
♻️ Duplicate comments (2)
crates/chain/src/chain.rs (1)

216-222: ⚠️ Potential issue | 🟠 Major

Align the reth stop timeout with the ServiceSet shutdown budget.

Line [1272] can take ~10s × service_count, but Line [216] is capped by RETH_THREAD_STOP_TIMEOUT (Line [405], 60s). With the current service list size, normal shutdown can exceed 60s and produce spurious timeout errors.

Suggested fix
+/// Timeout for ServiceSet graceful shutdown aggregation.
+const SERVICE_SET_TIMEOUT: Duration = Duration::from_secs(120);
 /// Timeout for the reth thread to complete its full shutdown sequence.
-/// Budget: send(5s) + actor(25s) + service_set(~10 services × 10s) + task_manager(10s).
-/// Using 60s to cover typical case with margin; worst-case depends on service count.
-const RETH_THREAD_STOP_TIMEOUT: Duration = Duration::from_secs(60);
+/// Budget: send(5s) + actor(25s) + service_set(120s) + task_manager(10s) + margin.
+const RETH_THREAD_STOP_TIMEOUT: Duration = Duration::from_secs(170);

-                    service_set.graceful_shutdown().await;
+                    match tokio::time::timeout(SERVICE_SET_TIMEOUT, service_set.graceful_shutdown()).await {
+                        Ok(()) => debug!("ServiceSet graceful shutdown complete"),
+                        Err(_) => warn!("ServiceSet graceful shutdown timed out after {SERVICE_SET_TIMEOUT:?}"),
+                    }

Also applies to: 390-405, 1270-1272

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/src/chain.rs` around lines 216 - 222, The fixed
RETH_THREAD_STOP_TIMEOUT is causing spurious timeouts because shutdown can take
~10s × service_count; replace uses of RETH_THREAD_STOP_TIMEOUT in the
tokio::time::timeout calls (the rx match in chain.rs and the other occurrences
around 390-405 and 1270-1272) with the ServiceSet shutdown budget calculation
(e.g., obtain the ServiceSet shutdown_budget or compute per_service_timeout *
service_count — the same value used for ServiceSet shutdown) so the reth stop
wait uses the same total timeout as the ServiceSet shutdown budget.
crates/chain/tests/api/hardfork_tests.rs (1)

164-170: ⚠️ Potential issue | 🟡 Minor

Remove duplicated scanned/skipped values from the debug event.

scanned and skipped are already structured fields; interpolating them again in the message duplicates the same data.

Suggested fix
     tracing::debug!(
         ?tx_id,
         scanned,
         skipped,
         max_height,
-        "tx not found in {scanned} blocks ({skipped} heights failed to load)"
+        "tx not found in scanned blocks"
     );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/api/hardfork_tests.rs` around lines 164 - 170, The
tracing::debug call currently interpolates scanned and skipped both as
structured fields and again inside the message string; remove that duplication
by leaving scanned and skipped as structured fields (they are already passed as
?scanned and ?skipped) and change the message to not include {scanned} or
{skipped} (e.g. "tx not found in scanned blocks (some heights failed to load)")
in the tracing::debug invocation that includes ?tx_id, ?scanned, ?skipped,
?max_height.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@crates/chain/tests/block_production/block_production.rs`:
- Line 1572: Replace the plain #[tokio::test] attribute on the test with
#[test_log::test(tokio::test)] to ensure it participates in the suite-wide
tracing/log-capture harness; update the test declaration that currently uses
#[tokio::test] to #[test_log::test(tokio::test)] and, in the test body where
initialize_tracing() is called, ensure you guard against re-registering a global
subscriber (e.g., make initialize_tracing() idempotent or skip setting a default
if one is already set) so it won’t panic when test_log has already registered a
subscriber.

---

Duplicate comments:
In `@crates/chain/src/chain.rs`:
- Around line 216-222: The fixed RETH_THREAD_STOP_TIMEOUT is causing spurious
timeouts because shutdown can take ~10s × service_count; replace uses of
RETH_THREAD_STOP_TIMEOUT in the tokio::time::timeout calls (the rx match in
chain.rs and the other occurrences around 390-405 and 1270-1272) with the
ServiceSet shutdown budget calculation (e.g., obtain the ServiceSet
shutdown_budget or compute per_service_timeout * service_count — the same value
used for ServiceSet shutdown) so the reth stop wait uses the same total timeout
as the ServiceSet shutdown budget.

In `@crates/chain/tests/api/hardfork_tests.rs`:
- Around line 164-170: The tracing::debug call currently interpolates scanned
and skipped both as structured fields and again inside the message string;
remove that duplication by leaving scanned and skipped as structured fields
(they are already passed as ?scanned and ?skipped) and change the message to not
include {scanned} or {skipped} (e.g. "tx not found in scanned blocks (some
heights failed to load)") in the tracing::debug invocation that includes ?tx_id,
?scanned, ?skipped, ?max_height.

ℹ️ Review info

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6cbe46c and ad42895.

📒 Files selected for processing (3)
  • crates/chain/src/chain.rs
  • crates/chain/tests/api/hardfork_tests.rs
  • crates/chain/tests/block_production/block_production.rs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
crates/chain/tests/block_production/unstake_refund.rs (1)

1100-1100: 🧹 Nitpick | 🔵 Trivial

heavy3_ prefix is still unexplained.

The rename from heavy_unpledge_and_unstake_concurrent_success_flow to heavy3_unpledge_and_unstake_concurrent_success_flow introduces a numeric prefix that differs from all other heavy_ tests in this file, and there is no inline comment or doc note explaining the convention (e.g., test tier, #[diag_slow] level). Either rename back to the established pattern or document what heavy3_ signifies.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/block_production/unstake_refund.rs` at line 1100, The test
function heavy3_unpledge_and_unstake_concurrent_success_flow uses an unexplained
numeric prefix; either revert the name to the established
heavy_unpledge_and_unstake_concurrent_success_flow or add a one-line comment or
docstring above heavy3_unpledge_and_unstake_concurrent_success_flow explaining
the meaning of the "3" (e.g., test tier or diag_slow level) so naming is
consistent and discoverable; update any references to the test name if you
rename the function and ensure any test attributes like #[diag_slow] are
mentioned in the comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/chain/tests/block_production/unstake_refund.rs`:
- Around line 872-911: The function get_block_receipts double-counts the timeout
by waiting up to seconds_to_wait in wait_for_reth_marker and then again using
max_retries for the receipts polling loop; change the polling loop to use a
small fixed retry window (e.g., 3–5 retries or a fixed few-second total like 5s)
instead of reusing seconds_to_wait so you don't wait a second full timeout after
wait_for_reth_marker returns; update the variables around poll_interval and
max_retries (used in the for retry loop and the final error message) accordingly
and keep the wait_for_reth_marker call and the receipts_by_block check logic in
get_block_receipts unchanged except for the reduced retry budget.
- Around line 872-878: Remove the redundant reth_ctx parameter from the async
function get_block_receipts and all its call sites; instead access the reth
adapter via node.node_ctx.reth_node_adapter inside get_block_receipts. Update
the function signature to drop the reth_ctx parameter, replace any uses of
reth_ctx inside get_block_receipts with node.node_ctx.reth_node_adapter, and
update every caller (e.g., where genesis_node and
genesis_node.node_ctx.reth_node_adapter.clone() were both passed) to pass only
node; ensure types match (clone if needed) when using
node.node_ctx.reth_node_adapter at call sites.

---

Duplicate comments:
In `@crates/chain/tests/block_production/unstake_refund.rs`:
- Line 1100: The test function
heavy3_unpledge_and_unstake_concurrent_success_flow uses an unexplained numeric
prefix; either revert the name to the established
heavy_unpledge_and_unstake_concurrent_success_flow or add a one-line comment or
docstring above heavy3_unpledge_and_unstake_concurrent_success_flow explaining
the meaning of the "3" (e.g., test tier or diag_slow level) so naming is
consistent and discoverable; update any references to the test name if you
rename the function and ensure any test attributes like #[diag_slow] are
mentioned in the comment.

ℹ️ Review info

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ad42895 and 3599e4e.

📒 Files selected for processing (1)
  • crates/chain/tests/block_production/unstake_refund.rs

Comment on lines +872 to 911
async fn get_block_receipts(
node: &IrysNodeTest<irys_chain::IrysNodeCtx>,
reth_ctx: &irys_reth_node_bridge::IrysRethNodeAdapter,
block_height: u64,
block_hash: FixedBytes<32>,
seconds_to_wait: usize,
) -> eyre::Result<Vec<reth::primitives::Receipt>> {
reth_ctx
.inner
.provider
.receipts_by_block(HashOrNumber::Hash(block_hash))?
.ok_or_else(|| eyre::eyre!("Receipts not found for block {:?}", block_hash))
node.wait_for_reth_marker(
BlockNumberOrTag::Number(block_height),
block_hash,
seconds_to_wait as u64,
)
.await?;

let poll_interval = Duration::from_millis(200);
let max_retries = ((seconds_to_wait * 1000) / poll_interval.as_millis() as usize).max(1);

for retry in 0..=max_retries {
if let Some(receipts) = reth_ctx
.inner
.provider
.receipts_by_block(HashOrNumber::Hash(block_hash))?
{
return Ok(receipts);
}

if retry == max_retries {
break;
}
sleep(poll_interval).await;
}

Err(eyre::eyre!(
"Receipts not found for block {:?} at height {} after {} retries ({}s)",
block_hash,
block_height,
max_retries,
seconds_to_wait
))
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

get_block_receipts double-counts the timeout: effective max wait is 2 × seconds_to_wait.

wait_for_reth_marker already burns up to seconds_to_wait seconds waiting for the block to land in reth. The max_retries calculation then budgets the same seconds_to_wait a second time for the receipts polling loop, so the worst-case wall-clock wait is 40 s when seconds_to_wait = 20. Because wait_for_reth_marker guarantees the block is committed before returning, receipts are almost always available on the first poll; a much shorter retry budget (e.g., a fixed few seconds) is sufficient after a successful marker wait.

♻️ Proposed fix
-    let poll_interval = Duration::from_millis(200);
-    let max_retries = ((seconds_to_wait * 1000) / poll_interval.as_millis() as usize).max(1);
+    // After wait_for_reth_marker succeeds the block is already in reth; receipts
+    // should appear quickly — use a short fixed budget rather than re-spending the
+    // full seconds_to_wait.
+    let poll_interval = Duration::from_millis(200);
+    let max_retries = 25_usize; // 25 × 200 ms = 5 s safety margin
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async fn get_block_receipts(
node: &IrysNodeTest<irys_chain::IrysNodeCtx>,
reth_ctx: &irys_reth_node_bridge::IrysRethNodeAdapter,
block_height: u64,
block_hash: FixedBytes<32>,
seconds_to_wait: usize,
) -> eyre::Result<Vec<reth::primitives::Receipt>> {
reth_ctx
.inner
.provider
.receipts_by_block(HashOrNumber::Hash(block_hash))?
.ok_or_else(|| eyre::eyre!("Receipts not found for block {:?}", block_hash))
node.wait_for_reth_marker(
BlockNumberOrTag::Number(block_height),
block_hash,
seconds_to_wait as u64,
)
.await?;
let poll_interval = Duration::from_millis(200);
let max_retries = ((seconds_to_wait * 1000) / poll_interval.as_millis() as usize).max(1);
for retry in 0..=max_retries {
if let Some(receipts) = reth_ctx
.inner
.provider
.receipts_by_block(HashOrNumber::Hash(block_hash))?
{
return Ok(receipts);
}
if retry == max_retries {
break;
}
sleep(poll_interval).await;
}
Err(eyre::eyre!(
"Receipts not found for block {:?} at height {} after {} retries ({}s)",
block_hash,
block_height,
max_retries,
seconds_to_wait
))
}
async fn get_block_receipts(
node: &IrysNodeTest<irys_chain::IrysNodeCtx>,
reth_ctx: &irys_reth_node_bridge::IrysRethNodeAdapter,
block_height: u64,
block_hash: FixedBytes<32>,
seconds_to_wait: usize,
) -> eyre::Result<Vec<reth::primitives::Receipt>> {
node.wait_for_reth_marker(
BlockNumberOrTag::Number(block_height),
block_hash,
seconds_to_wait as u64,
)
.await?;
// After wait_for_reth_marker succeeds the block is already in reth; receipts
// should appear quickly — use a short fixed budget rather than re-spending the
// full seconds_to_wait.
let poll_interval = Duration::from_millis(200);
let max_retries = 25_usize; // 25 × 200 ms = 5 s safety margin
for retry in 0..=max_retries {
if let Some(receipts) = reth_ctx
.inner
.provider
.receipts_by_block(HashOrNumber::Hash(block_hash))?
{
return Ok(receipts);
}
if retry == max_retries {
break;
}
sleep(poll_interval).await;
}
Err(eyre::eyre!(
"Receipts not found for block {:?} at height {} after {} retries ({}s)",
block_hash,
block_height,
max_retries,
seconds_to_wait
))
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/block_production/unstake_refund.rs` around lines 872 -
911, The function get_block_receipts double-counts the timeout by waiting up to
seconds_to_wait in wait_for_reth_marker and then again using max_retries for the
receipts polling loop; change the polling loop to use a small fixed retry window
(e.g., 3–5 retries or a fixed few-second total like 5s) instead of reusing
seconds_to_wait so you don't wait a second full timeout after
wait_for_reth_marker returns; update the variables around poll_interval and
max_retries (used in the for retry loop and the final error message) accordingly
and keep the wait_for_reth_marker call and the receipts_by_block check logic in
get_block_receipts unchanged except for the reduced retry budget.

Comment on lines +872 to 878
async fn get_block_receipts(
node: &IrysNodeTest<irys_chain::IrysNodeCtx>,
reth_ctx: &irys_reth_node_bridge::IrysRethNodeAdapter,
block_height: u64,
block_hash: FixedBytes<32>,
seconds_to_wait: usize,
) -> eyre::Result<Vec<reth::primitives::Receipt>> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the function definition and nearby code
head -c 3000 crates/chain/tests/block_production/unstake_refund.rs | tail -c 2000

Repository: Irys-xyz/irys

Length of output: 2057


🏁 Script executed:

# Get the full function definition
sed -n '872,920p' crates/chain/tests/block_production/unstake_refund.rs

Repository: Irys-xyz/irys

Length of output: 1432


🏁 Script executed:

# Find all call sites of get_block_receipts
rg -n "get_block_receipts" crates/chain/tests/block_production/unstake_refund.rs -A3 -B1

Repository: Irys-xyz/irys

Length of output: 1721


🏁 Script executed:

# Search for the call context around lines 98-105 and 180-187
sed -n '95,110p' crates/chain/tests/block_production/unstake_refund.rs

Repository: Irys-xyz/irys

Length of output: 494


🏁 Script executed:

# Check the IrysNodeTest struct definition
fd -e rs | xargs rg -l "struct IrysNodeTest"

Repository: Irys-xyz/irys

Length of output: 84


🏁 Script executed:

# Look for reth_node_adapter in relation to IrysNodeTest
rg -n "reth_node_adapter" crates/chain/tests/ --type rust -B5 -A2 | head -60

Repository: Irys-xyz/irys

Length of output: 3719


Remove the redundant reth_ctx parameter — it can be accessed through node.node_ctx.reth_node_adapter.

At every call site (lines 98–105, 180–187, 242–245, etc.), both node (&genesis_node) and reth_ctx (genesis_node.node_ctx.reth_node_adapter.clone()) come from the same underlying object. Since IrysNodeTest already exposes reth_node_adapter through node_ctx, the separate parameter is unnecessary. Collapse them by removing reth_ctx from the signature and replace its usages with node.node_ctx.reth_node_adapter, reducing caller boilerplate.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/block_production/unstake_refund.rs` around lines 872 -
878, Remove the redundant reth_ctx parameter from the async function
get_block_receipts and all its call sites; instead access the reth adapter via
node.node_ctx.reth_node_adapter inside get_block_receipts. Update the function
signature to drop the reth_ctx parameter, replace any uses of reth_ctx inside
get_block_receipts with node.node_ctx.reth_node_adapter, and update every caller
(e.g., where genesis_node and genesis_node.node_ctx.reth_node_adapter.clone()
were both passed) to pass only node; ensure types match (clone if needed) when
using node.node_ctx.reth_node_adapter at call sites.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
crates/chain/tests/api/hardfork_tests.rs (1)

165-170: ⚠️ Potential issue | 🟡 Minor

Remove duplicated scanned/skipped values in the debug log.

Line 169 interpolates {scanned} and {skipped} while those fields are already emitted structurally, so the event records them twice.

♻️ Proposed fix
     tracing::debug!(
         ?tx_id,
         scanned,
         skipped,
         max_height,
-        "tx not found in {scanned} blocks ({skipped} heights failed to load)"
+        "tx not found in searched blocks"
     );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/api/hardfork_tests.rs` around lines 165 - 170, The
debug/event log currently emits scanned and skipped both as structured fields
and again via string interpolation in the message; edit the log call that
includes tx_id, scanned, skipped, max_height and the message "tx not found in
{scanned} blocks ({skipped} heights failed to load)" to remove the "{scanned}"
and "{skipped}" interpolations from the message so those values are only emitted
once as structured fields (leave tx_id and max_height usage intact).
crates/chain/tests/utils.rs (2)

1446-1483: ⚠️ Potential issue | 🟠 Major

RecvError::Lagged is still treated as fatal in wait_for_reorg_subscribed.

Line 1464 returns an error on Lagged, which is a recoverable condition — the receiver is repositioned to the oldest retained message. Under CI load this causes spurious failures. Wrap in a loop, log the lag, and continue waiting until the deadline.

Suggested fix
     async fn wait_for_reorg_subscribed(
         &self,
         mut reorg_rx: tokio::sync::broadcast::Receiver<ReorgEvent>,
         seconds_to_wait: usize,
         start_tip: H256,
     ) -> eyre::Result<ReorgEvent> {
-        let timeout_duration = Duration::from_secs(seconds_to_wait as u64);
-        match tokio::time::timeout(timeout_duration, reorg_rx.recv()).await {
-            Ok(Ok(reorg_event)) => {
-                info!(
-                    "Reorg detected: {} blocks in old fork, {} in new fork, fork at height {}, new tip: {}",
-                    reorg_event.old_fork.len(),
-                    reorg_event.new_fork.len(),
-                    reorg_event.fork_parent.height,
-                    reorg_event.new_tip
-                );
-                Ok(reorg_event)
-            }
-            Ok(Err(tokio::sync::broadcast::error::RecvError::Lagged(skipped))) => Err(eyre::eyre!(
-                "Reorg broadcast receiver lagged and skipped {} events",
-                skipped
-            )),
-            Ok(Err(tokio::sync::broadcast::error::RecvError::Closed)) => {
-                Err(eyre::eyre!("Reorg broadcast channel closed"))
-            }
-            Err(_) => {
-                let current_tip = self.get_max_difficulty_block().block_hash;
-                let state = self.diag_wait_state().await;
-                Err(eyre::eyre!(
-                    "Timeout: No reorg event received within {} seconds (start_tip={}, current_tip={}). State: {}",
-                    seconds_to_wait,
-                    start_tip,
-                    current_tip,
-                    state
-                ))
+        let deadline = tokio::time::Instant::now()
+            + Duration::from_secs(seconds_to_wait as u64);
+        loop {
+            let remaining = deadline.saturating_duration_since(tokio::time::Instant::now());
+            match tokio::time::timeout(remaining, reorg_rx.recv()).await {
+                Ok(Ok(reorg_event)) => {
+                    info!(
+                        "Reorg detected: {} blocks in old fork, {} in new fork, fork at height {}, new tip: {}",
+                        reorg_event.old_fork.len(),
+                        reorg_event.new_fork.len(),
+                        reorg_event.fork_parent.height,
+                        reorg_event.new_tip
+                    );
+                    return Ok(reorg_event);
+                }
+                Ok(Err(tokio::sync::broadcast::error::RecvError::Lagged(skipped))) => {
+                    warn!("Reorg receiver lagged; skipped {skipped} events, continuing");
+                    continue;
+                }
+                Ok(Err(tokio::sync::broadcast::error::RecvError::Closed)) => {
+                    return Err(eyre::eyre!("Reorg broadcast channel closed"));
+                }
+                Err(_) => {
+                    let current_tip = self.get_max_difficulty_block().block_hash;
+                    let state = self.diag_wait_state().await;
+                    return Err(eyre::eyre!(
+                        "Timeout: No reorg event received within {} seconds (start_tip={}, current_tip={}). State: {}",
+                        seconds_to_wait,
+                        start_tip,
+                        current_tip,
+                        state
+                    ));
+                }
             }
         }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/utils.rs` around lines 1446 - 1483, In
wait_for_reorg_subscribed, treat
tokio::sync::broadcast::error::RecvError::Lagged as non-fatal by wrapping the
recv call in a loop that retries until the timeout_duration elapses: on
RecvError::Lagged(skipped) log the skipped count (using info/debug) and continue
waiting, on RecvError::Closed return the existing error, and on successful
Ok(reorg_event) return it; ensure the timeout uses the original deadline so
repeated Lagged occurrences don’t reset the timer and use the existing symbols
reorg_rx, timeout_duration, seconds_to_wait, start_tip,
get_max_difficulty_block, and diag_wait_state to preserve current
timeout/diagnostic behavior.

2249-2272: wait_for_tx_included still swallows all errors indiscriminately.

Line 2263 catches every Err(_) as a retry, masking channel/service failures that would never self-resolve. Only "not found" errors should be retried; genuine service failures should propagate immediately for better diagnostics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/utils.rs` around lines 2249 - 2272, wait_for_tx_included
is currently treating every Err from get_storage_tx_header_from_mempool as a
transient “retry”, which hides permanent service/channel failures; change the
error handling so you only retry when the error represents “not found” (or
whatever concrete NotFound variant your storage API returns) and immediately
return Err for all other errors. Locate wait_for_tx_included and the call to
get_storage_tx_header_from_mempool(tx_id) and replace the blanket Err(_) arm
with logic that inspects the error (e.g., pattern-match on the concrete error
enum or use an is_not_found helper) — if it’s a not-found case continue polling,
otherwise propagate the error (return Err(e)) so real failures surface. Ensure
the returned error preserves the original error value/type instead of swallowing
it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/chain/tests/utils.rs`:
- Around line 3519-3594: The code auto-starts the VDF in
solution_context_with_poa_chunk by calling node_ctx.start_vdf() then runs an
async block but calls node_ctx.stop_vdf() only after awaiting the block, so a
panic in the block can leave VDF running; fix by making the restore panic-safe —
either create a small RAII guard type (e.g., VdfGuard) that captures node_ctx
and the original was_vdf_enabled and calls node_ctx.stop_vdf() in Drop when it
started the VDF, or use scopeguard::defer! immediately after start_vdf() to
ensure stop_vdf() always runs; apply the identical change to solution_context as
well and reference node_ctx.start_vdf, node_ctx.stop_vdf,
solution_context_with_poa_chunk, and solution_context when making the change.

---

Duplicate comments:
In `@crates/chain/tests/api/hardfork_tests.rs`:
- Around line 165-170: The debug/event log currently emits scanned and skipped
both as structured fields and again via string interpolation in the message;
edit the log call that includes tx_id, scanned, skipped, max_height and the
message "tx not found in {scanned} blocks ({skipped} heights failed to load)" to
remove the "{scanned}" and "{skipped}" interpolations from the message so those
values are only emitted once as structured fields (leave tx_id and max_height
usage intact).

In `@crates/chain/tests/utils.rs`:
- Around line 1446-1483: In wait_for_reorg_subscribed, treat
tokio::sync::broadcast::error::RecvError::Lagged as non-fatal by wrapping the
recv call in a loop that retries until the timeout_duration elapses: on
RecvError::Lagged(skipped) log the skipped count (using info/debug) and continue
waiting, on RecvError::Closed return the existing error, and on successful
Ok(reorg_event) return it; ensure the timeout uses the original deadline so
repeated Lagged occurrences don’t reset the timer and use the existing symbols
reorg_rx, timeout_duration, seconds_to_wait, start_tip,
get_max_difficulty_block, and diag_wait_state to preserve current
timeout/diagnostic behavior.
- Around line 2249-2272: wait_for_tx_included is currently treating every Err
from get_storage_tx_header_from_mempool as a transient “retry”, which hides
permanent service/channel failures; change the error handling so you only retry
when the error represents “not found” (or whatever concrete NotFound variant
your storage API returns) and immediately return Err for all other errors.
Locate wait_for_tx_included and the call to
get_storage_tx_header_from_mempool(tx_id) and replace the blanket Err(_) arm
with logic that inspects the error (e.g., pattern-match on the concrete error
enum or use an is_not_found helper) — if it’s a not-found case continue polling,
otherwise propagate the error (return Err(e)) so real failures surface. Ensure
the returned error preserves the original error value/type instead of swallowing
it.

ℹ️ Review info

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3599e4e and 0d0df2a.

📒 Files selected for processing (2)
  • crates/chain/tests/api/hardfork_tests.rs
  • crates/chain/tests/utils.rs

Comment on lines +3519 to +3594
let was_vdf_enabled = node_ctx
.is_vdf_mining_enabled
.load(std::sync::atomic::Ordering::Relaxed);
if !was_vdf_enabled {
node_ctx.start_vdf();
}

let result = async {
// Ensure the VDF has at least two steps materialized (N-1, N)
let vdf_steps_guard = node_ctx.vdf_steps_guard.clone();
let start = std::time::Instant::now();
let max_wait = std::time::Duration::from_secs(5);
let (step, steps) = loop {
if start.elapsed() > max_wait {
return Err(eyre::eyre!(
"VDF steps unavailable: timed out waiting for (prev,current) pair"
));
}
let s = vdf_steps_guard.read().global_step;
if s >= 1 {
if let Ok(steps) = vdf_steps_guard.read().get_steps(ii(s - 1, s)) {
if steps.len() >= 2 {
break (s, steps);
}
}
}
}
tokio::time::sleep(std::time::Duration::from_millis(50)).await;
};
tokio::time::sleep(std::time::Duration::from_millis(50)).await;
};

// Compute checkpoints for (step-1)
let mut hasher = Sha256::new();
let mut salt =
irys_types::U256::from(step_number_to_salt_number(&node_ctx.config.vdf, step - 1));
let mut seed = steps[0];
let mut checkpoints: Vec<H256> =
vec![H256::default(); node_ctx.config.vdf.num_checkpoints_in_vdf_step];
vdf_sha(
&mut hasher,
&mut salt,
&mut seed,
node_ctx.config.vdf.num_checkpoints_in_vdf_step,
node_ctx.config.vdf.num_iterations_per_checkpoint(),
&mut checkpoints,
);
// Compute checkpoints for (step-1)
let mut hasher = Sha256::new();
let mut salt =
irys_types::U256::from(step_number_to_salt_number(&node_ctx.config.vdf, step - 1));
let mut seed = steps[0];
let mut checkpoints: Vec<H256> =
vec![H256::default(); node_ctx.config.vdf.num_checkpoints_in_vdf_step];
vdf_sha(
&mut hasher,
&mut salt,
&mut seed,
node_ctx.config.vdf.num_checkpoints_in_vdf_step,
node_ctx.config.vdf.num_iterations_per_checkpoint(),
&mut checkpoints,
);

// For deterministic linkage without recall-range dependency, use offset 0
let partition_hash = H256::zero();
let partition_chunk_offset: u32 = 0;
// For deterministic linkage without recall-range dependency, use offset 0
let partition_hash = H256::zero();
let partition_chunk_offset: u32 = 0;

// Compute solution_hash = sha256(poa_chunk || offset_le || vdf_output)
let mut hasher_sol = Sha256::new();
hasher_sol.update(&poa_chunk);
hasher_sol.update(partition_chunk_offset.to_le_bytes());
hasher_sol.update(steps[1].as_bytes());
let solution_hash = H256::from_slice(hasher_sol.finalize().as_slice());

Ok(SolutionContext {
partition_hash,
chunk_offset: partition_chunk_offset,
mining_address: node_ctx.config.node_config.miner_address(),
tx_path: None,
data_path: None,
chunk: poa_chunk,
vdf_step: step,
checkpoints: H256List(checkpoints),
seed: Seed(steps[1]),
solution_hash,
})
}
.await;

// Compute solution_hash = sha256(poa_chunk || offset_le || vdf_output)
let mut hasher_sol = Sha256::new();
hasher_sol.update(&poa_chunk);
hasher_sol.update(partition_chunk_offset.to_le_bytes());
hasher_sol.update(steps[1].as_bytes());
let solution_hash = H256::from_slice(hasher_sol.finalize().as_slice());
if !was_vdf_enabled {
node_ctx.stop_vdf();
}

node_ctx.stop_vdf();
Ok(SolutionContext {
partition_hash,
chunk_offset: partition_chunk_offset,
mining_address: node_ctx.config.node_config.miner_address(),
tx_path: None,
data_path: None,
chunk: poa_chunk,
vdf_step: step,
checkpoints: H256List(checkpoints),
seed: Seed(steps[1]),
solution_hash,
})
result

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

VDF auto-start in solution_context_with_poa_chunk is not panic-safe.

If the inner async block panics (lines 3526–3587), the stop_vdf() call at line 3591 is never reached. While test panics typically abort, in #[should_panic] tests or multi-node scenarios this could leave the VDF running. A scope-guard (e.g. scopeguard::defer!) or refactoring the restore into a Drop wrapper would make this robust.

The same pattern exists in solution_context (lines 3609–3624).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/chain/tests/utils.rs` around lines 3519 - 3594, The code auto-starts
the VDF in solution_context_with_poa_chunk by calling node_ctx.start_vdf() then
runs an async block but calls node_ctx.stop_vdf() only after awaiting the block,
so a panic in the block can leave VDF running; fix by making the restore
panic-safe — either create a small RAII guard type (e.g., VdfGuard) that
captures node_ctx and the original was_vdf_enabled and calls node_ctx.stop_vdf()
in Drop when it started the VDF, or use scopeguard::defer! immediately after
start_vdf() to ensure stop_vdf() always runs; apply the identical change to
solution_context as well and reference node_ctx.start_vdf, node_ctx.stop_vdf,
solution_context_with_poa_chunk, and solution_context when making the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants