test: stability#1142
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRefactors node shutdown to a token-driven async model with per-subsystem completion receivers; adds a proc-macro Changes
Sequence Diagram(s)sequenceDiagram
participant Main as Main (startup/stop)
participant InitServices as init_services_thread<br/>oneshot rx
participant InitReth as init_reth_thread<br/>oneshot rx
participant InitVDF as init_vdf_thread<br/>oneshot rx
participant Subsys as Subsystems (Actor / Reth / VDF)
participant Stop as stop()/Shutdown
Main->>Main: create CancellationToken (shutdown_token)
Main->>InitServices: start(token)
Main->>InitReth: start(token)
Main->>InitVDF: start(token)
InitServices-->>Main: return actor_done_rx (oneshot)
InitReth-->>Main: return reth_done_rx (oneshot)
InitVDF-->>Main: return vdf_done_rx (oneshot)
Note over Main,Subsys: runtime operates
Main->>Stop: call stop()
Stop->>Main: token.cancel()
Main->>Subsys: subsystems observe token cancelled
Subsys-->>Main: signal completion via respective oneshot receivers
Main->>Main: await reth_done_rx (with timeout)
Main->>Main: await vdf_done_rx (with timeout)
Main->>Main: await actor_done_rx (with timeout)
Main->>Main: finalize resource release
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Previously, when the actor main thread received a `None` shutdown reason (e.g. if the reth thread panicked before sending), the VDF shutdown signal was never sent but `vdf_thread.join()` was still called, causing an infinite hang during `stop()`. Also handle channel disconnection in the VDF loop as a safety net so the thread exits even if the sender is dropped without sending.
- Add `wait_for_block_in_index` helper that polls both index entry AND DB header, fixing the race where the index height updates before the block header is persisted to the database. - Use the new helper in `heavy3_mine_ten_blocks_with_migration_depth_two` and `heavy3_reth_restarts_use_block_index_before_sync`. - Rename `solution_reused_when_parent_changes_but_valid` to `heavy_` prefix (2-node test) and relax VDF step assertion from `>` to `>=`. - Rename `evm_payload_with_versioned_hashes_is_rejected` and `evm_payload_with_withdrawals_is_rejected` to `heavy_` prefix so they get adequate thread resources in CI.
- Add retry logic to `find_tx_in_blocks` helper (up to 10 retries with 1s delay) so transient block unavailability doesn't cause false negatives in hardfork tests. - Add `wait_for_mempool_commitment_txs` before mining in `heavy_pending_pledges_test` to ensure async pending-pledge reprocessing completes before block production. - Rename remaining blobs_rejected tests (`evm_payload_with_blob_gas_used` and `evm_payload_with_excess_blob_gas`) to `heavy_` prefix for adequate CI thread allocation. - Rename `test_auto_stake_pledge` to `heavy_` prefix (runs 2 full nodes). - Replace arbitrary 2-second sleep in auto stake & pledge task with `sync_state.wait_for_sync()` so the task waits for block sync to complete rather than racing against a fixed delay.
Add block_quiescence helper to IrysNodeTest that subscribes to BlockStateUpdated events and waits until no events arrive for an idle period (subscribe-before-act pattern to avoid missing events). In chain.rs, subscribe before initial_sync and await quiescence in the spawned auto stake task, replacing the fragile 2-second sleep. Document the pattern in the integration test skill.
Move from flat file to skill-name/SKILL.md structure for proper discovery. Improve description (CSO-compliant triggers only), add overview, input contract, decision flowchart, common mistakes, and "when not to use" section.
…tion read_block_from_state() subscribed to the broadcast channel after blocks were already sent for validation, causing a race where fast validation could complete and discard the block before the subscription was created. This made tests flaky with "Timeout waiting for block validation" errors. Move the subscription before the send at all 24 call sites, and add a send_block_and_read_state() helper for the common pattern.
Reduced from 269 to 107 lines by removing exhaustive API tables (agent can read utils.rs directly) and adding an example tests table pointing to real tests for each common pattern.
Replace wait_until_height (1s polling) with wait_for_block_at_height (event-driven) in mine_blocks() and mine_block() to eliminate CPU starvation flakes in CI. Bump slow_heavy3_mine_ten_blocks to heavy4 for autonomous mining thread headroom.
Replace implicit timing dependencies with explicit state synchronization: - heavy_test_mine_tx: poll for mempool included_height after block confirmation - heavy_test_multi_node_epoch_replay: wait for peer to sync epoch blocks via gossip - heavy_unpledge_all_partitions_refund_flow: wait for genesis to receive peer's block - heavy_solution_discarded_vdf_too_old: wait for node1 to sync node2's blocks before resume
…ndition
Two independent fixes for heavy_test_aurora_hardfork_recovery_peer_sync flake:
1. BlockTreeService shutdown resilience (block_tree_service.rs):
- Make chunk_migration channel send non-fatal on close (warn instead of error)
- Make shutdown drain loop tolerant of downstream channel closure
2. Hardfork test race condition (hardfork_tests.rs):
- Replace wait_until_block_index_height with wait_for_block_in_index
- Root cause: in-memory block index updates synchronously but MDBX
persistence happens asynchronously via broadcast event in mempool
service. The old wait returned before DB writes completed.
Replace wait_until_block_index_height (checks only in-memory index) with wait_for_block_in_index (checks index + MDBX) before calling get_block_by_height_from_index which reads from the database.
Replace polling wait_until_height_confirmed (1s sleep loop) with event-driven wait_for_block_at_height + block_quiescence in the ema_forks test. The test intentionally gossips only the tip block to exercise the background block sync path. The old polling approach wasted timeout budget under CI load causing flaky timeouts. Also revert nextest retries to 0 (masks flakiness instead of fixing it).
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
crates/chain/tests/multi_node/mempool_tests.rs (1)
340-344:⚠️ Potential issue | 🟡 MinorUse
#[test_log::test(tokio::test)]for this tracing-heavy async test.This test explicitly sets
RUST_LOGand callsinitialize_tracing(), but is annotated with#[tokio::test]. Per the coding guidelines it should use#[test_log::test(tokio::test)].🔧 Proposed fix
-#[tokio::test] +#[test_log::test(tokio::test)] async fn heavy_pending_pledges_test() -> eyre::Result<()> { - std::env::set_var("RUST_LOG", "debug"); - initialize_tracing();As per coding guidelines: "Use
#[test_log::test(tokio::test)]for tests that require tracing output".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/chain/tests/multi_node/mempool_tests.rs` around lines 340 - 344, The test heavy_pending_pledges_test is currently annotated with #[tokio::test] but configures tracing (sets RUST_LOG and calls initialize_tracing()), so change the test attribute to use the test_log wrapper: replace #[tokio::test] with #[test_log::test(tokio::test)] on the heavy_pending_pledges_test function to ensure tracing output is captured per project guidelines.crates/chain/tests/data_sync/sync_partition_data_tests.rs (1)
197-245: 🧹 Nitpick | 🔵 TrivialCollapse repeated per-node slot evaluation into a helper.
The genesis/peer1/peer2 expectation checks are structurally identical. Extracting one helper would reduce copy/paste drift and keep future expectation changes consistent.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/chain/tests/data_sync/sync_partition_data_tests.rs` around lines 197 - 245, The three repeated blocks that call check_storage_module_chunks and slot_matches_expectation for genesis, peer1, and peer2 should be collapsed into a single helper function (e.g., evaluate_node or check_node_slots) that takes the node handle, a node label string, and the expectations (slot0_expectation, slot1_expectation) and returns a (synced: bool, diag: String) tuple; inside the helper call check_storage_module_chunks for DataLedger::Publish and DataLedger::Submit for slots 0 and 1 and reuse slot_matches_expectation to compute booleans and diagnostics, then replace the genesis_synced/peer1_synced/peer2_synced assignments and last_genesis_diag/last_peer1_diag/last_peer2_diag formatting with calls to this helper to eliminate duplication and ensure consistent behavior.
♻️ Duplicate comments (11)
crates/chain/tests/block_production/unstake_refund.rs (1)
1022-1022:heavy3_prefix is still unexplained — prior review concern unresolved.The numeric suffix introduced by renaming
heavy_unpledge_and_unstake_concurrent_success_flow→heavy3_unpledge_and_unstake_concurrent_success_flowis still undocumented. All other tests in this file continue to use plainheavy_prefixes. Ifheavy3_denotes a diagnostic/timeout tier associated with#[diag_slow], a brief doc-comment above the function would clarify the convention for future maintainers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/chain/tests/block_production/unstake_refund.rs` at line 1022, The test function heavy3_unpledge_and_unstake_concurrent_success_flow has an unexplained numeric prefix "heavy3_"—add a concise doc-comment immediately above the function that documents the naming convention (e.g., that "heavy"/"heavyN" denotes test weight/diagnostic/timeout tier and that "3" maps to the #[diag_slow] or specific timeout/diagnostic level used) so future maintainers understand the suffix; reference the function name heavy3_unpledge_and_unstake_concurrent_success_flow and, if relevant, mention the association with #[diag_slow] and any expected timeout/behavior in that single-line doc-comment.crates/chain/tests/multi_node/mempool_tests.rs (1)
1924-1925:⚠️ Potential issue | 🟡 Minor
slow_heavy3_test_evm_gossipstill uses#[tokio::test]despite settingRUST_LOGand callinginitialize_tracing().This is unchanged from the previously flagged concern — the attribute should be
#[test_log::test(tokio::test)].🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/chain/tests/multi_node/mempool_tests.rs` around lines 1924 - 1925, The test function slow_heavy3_test_evm_gossip is still annotated with #[tokio::test] even though you set RUST_LOG and call initialize_tracing(); replace the attribute with #[test_log::test(tokio::test)] on the slow_heavy3_test_evm_gossip function so the test harness initializes logging/tracing properly when running under Tokio.crates/actors/src/services.rs (1)
228-237: 🧹 Nitpick | 🔵 Trivial
()return type still collapses three distinct exit causes.All three exit paths — idle gap elapsed, deadline exceeded, channel closed — return
(), making it impossible for call sites to distinguish a healthy idle from an unexpected channel close or a deadline overrun. This was raised in a previous review on the predecessor function.♻️ Proposed fix — discriminated return type
+#[derive(Debug, PartialEq, Eq)] +pub enum BroadcastIdleResult { + /// No events arrived for the full `idle` window. + Idle, + /// The outer deadline was exceeded before the idle window could elapse. + Deadline, + /// The channel was closed. + Closed, +} pub async fn wait_until_broadcast_idle<T: Clone>( rx: &mut tokio::sync::broadcast::Receiver<T>, idle: std::time::Duration, deadline: tokio::time::Instant, -) { +) -> BroadcastIdleResult { loop { match tokio::time::timeout_at(deadline, tokio::time::timeout(idle, rx.recv())).await { Ok(Ok(Ok(_))) => continue, Ok(Ok(Err(tokio::sync::broadcast::error::RecvError::Lagged(_)))) => continue, - Ok(Ok(Err(tokio::sync::broadcast::error::RecvError::Closed))) => break, - Ok(Err(_)) => break, - Err(_) => break, + Ok(Ok(Err(tokio::sync::broadcast::error::RecvError::Closed))) => { + return BroadcastIdleResult::Closed; + } + Ok(Err(_)) => return BroadcastIdleResult::Idle, + Err(_) => return BroadcastIdleResult::Deadline, } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/actors/src/services.rs` around lines 228 - 237, The loop handling tokio::time::timeout_at(deadline, tokio::time::timeout(idle, rx.recv())) currently collapses three distinct exit causes into unit () — change the function to return a discriminated enum (e.g., Idle, DeadlineExceeded, ChannelClosed or similar) instead of () and have each match arm return the appropriate variant (map Ok(Ok(Ok(_))) => Idle, Err(_) from timeout_at => DeadlineExceeded, Ok(Ok(Err(Closed))) => ChannelClosed, and preserve Lagged as Idle/continue or a distinct variant if desired); update callers to handle the new enum. Ensure to define the enum near the function and update match arms that now use tokio::time::timeout_at, tokio::time::timeout, rx.recv(), and other referenced symbols so exit causes are distinguishable.crates/vdf/src/vdf.rs (1)
84-86:⚠️ Potential issue | 🟠 MajorCancellation can still be starved during fast-forward drain.
The
while let Ok(...)loop at line 90 drains all pending fast-forward messages without re-checkingshutdown_token.is_cancelled(). Under sustained fast-forward traffic, shutdown could be delayed indefinitely.🛠️ Proposed fix
- loop { + 'vdf_loop: loop { if shutdown_token.is_cancelled() { tracing::info!("VDF loop: shutdown token cancelled, exiting"); break; } // check for VDF fast forward step while let Ok(traced_ff_step) = fast_forward_receiver.try_recv() { + if shutdown_token.is_cancelled() { + tracing::info!("VDF loop: shutdown token cancelled during fast-forward drain, exiting"); + break 'vdf_loop; + } let (proposed_ff_step, _entered) = traced_ff_step.into_inner();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vdf/src/vdf.rs` around lines 84 - 86, The fast-forward drain loop (the while let Ok(...) loop in vdf.rs that consumes fast-forward messages) can starve shutdown because it does not re-check shutdown_token.is_cancelled() while draining; modify the loop so it checks shutdown_token.is_cancelled() between iterations (or incorporate the token into the loop condition) and break immediately if cancelled, ensuring the VDF loop exits promptly even under sustained fast-forward traffic.crates/chain/tests/block_production/unpledge_refund.rs (2)
697-709:⚠️ Potential issue | 🟡 MinorInconsistent error propagation:
.await?vs.await.expect("…").Line 702 uses
.await?while other equivalentwait_for_block_at_heightsites (lines 88, 212, 799) use.await.expect("…")with descriptive messages. The.expectform provides better CI diagnostics.♻️ Proposed fix
genesis_node .wait_for_block_at_height(inclusion_block.height, seconds_to_wait) - .await?; + .await + .expect("genesis should sync peer-mined inclusion block");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/chain/tests/block_production/unpledge_refund.rs` around lines 697 - 709, Replace the fallible `.await?` calls with `.await.expect("...")` to match the rest of the test and improve CI diagnostics: after calling `peer_node.mine_block().await?` use `genesis_node.wait_for_block_at_height(...).await.expect("timed out waiting for genesis to receive mined block via gossip")` and likewise change the `genesis_node.wait_for_reth_marker(...).await?` to `.await.expect("timed out waiting for genesis Reth marker for mined block")`; locate these changes around the calls to `peer_node.mine_block`, `genesis_node.wait_for_block_at_height`, and `genesis_node.wait_for_reth_marker` and use descriptive messages consistent with other occurrences.
85-89:⚠️ Potential issue | 🟡 MinorMisleading variable name:
inclusion_block_peeris mined by genesis, not the peer.
genesis_node.mine_block()is called on line 85, but the variable name and.expectmessage suggest the block comes from a peer. This was flagged in a previous review and remains unaddressed.🔤 Proposed rename
- let inclusion_block_peer = genesis_node.mine_block().await?; + let inclusion_block_mined = genesis_node.mine_block().await?; genesis_node - .wait_for_block_at_height(inclusion_block_peer.height, seconds_to_wait) + .wait_for_block_at_height(inclusion_block_mined.height, seconds_to_wait) .await - .expect("genesis should sync peer-mined inclusion block"); + .expect("genesis should finish processing inclusion block"); let inclusion_block = genesis_node - .get_block_by_height(inclusion_block_peer.height) + .get_block_by_height(inclusion_block_mined.height) .await?;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/chain/tests/block_production/unpledge_refund.rs` around lines 85 - 89, The variable name inclusion_block_peer is misleading because the block is produced by genesis_node.mine_block(); rename it to something like inclusion_block or genesis_mined_block and update the assertion message in the wait_for_block_at_height().expect(...) call (currently "genesis should sync peer-mined inclusion block") to reflect that the block was mined by genesis (e.g., "genesis should sync its mined inclusion block"); change references to inclusion_block_peer to the new name and ensure tests continue to compile and run.crates/actors/src/block_tree_service.rs (1)
183-190:⚠️ Potential issue | 🟠 MajorDrain loop still aborts early for non-
BlockValidationFinishedmessages during shutdown.Skipping
BlockValidationFinishedis a good step, butBlockPreValidatedmessages can also fail during shutdown (e.g.,ValidationServiceUnreachablewhen the validation service channel is already closed at line 381). The?on line 189 will abort the drain and bubble an error out ofstart().Consider logging and continuing for all message types during the drain, not just
BlockValidationFinished:🛠️ Proposed fix
while let Ok(traced) = self.msg_rx.try_recv() { let (msg, parent_span) = traced.into_parts(); match msg { // Skip: talks to downstream services (reth FCU, migration) that // may already be stopped during shutdown. BlockTreeServiceMessage::BlockValidationFinished { .. } => { debug!("Skipping BlockValidationFinished during shutdown drain"); } - msg => self.inner.handle_message(msg, parent_span).await?, + msg => { + if let Err(e) = self.inner.handle_message(msg, parent_span).await { + debug!("Error handling message during shutdown drain: {:?}", e); + } + } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/actors/src/block_tree_service.rs` around lines 183 - 190, The drain loop in start() currently only skips BlockValidationFinished and awaits self.inner.handle_message(msg, parent_span).await? which will propagate any error (e.g., ValidationServiceUnreachable for BlockPreValidated) and abort the drain; change the logic so that inside the match over BlockTreeServiceMessage you call self.inner.handle_message(msg, parent_span).await but do not use the ? — instead match on the Result, log the error with context (including the message variant) and continue the loop; ensure BlockValidationFinished remains skipped and that errors from handle_message (for example when handling BlockPreValidated) are logged and ignored during shutdown rather than returned..claude/skills/debugging-flaky-tests/SKILL.md (1)
185-185:⚠️ Potential issue | 🟡 MinorParameter name
chunkis misleading — actual argument is a boolean.The past review noted this: actual call sites pass a
bool(e.g.,wait_for_block_in_index(migration_height, false, seconds_to_wait)). The parameter should reflect this.📝 Proposed fix
-| `wait_for_block_in_index(h, chunk, secs)` | Polling DB (1s) | Need block in persistent index | +| `wait_for_block_in_index(h, include_txs, secs)` | Polling DB (1s) | Need block in persistent index |🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/debugging-flaky-tests/SKILL.md at line 185, The parameter name chunk in function wait_for_block_in_index(h, chunk, secs) is misleading because callers pass a boolean; rename the parameter to something like expect_persistent (or is_persistent) and update the function signature/type hints, its docstring, all internal logic that treats it as a boolean, and every call site that passes a boolean (e.g., wait_for_block_in_index(migration_height, false, seconds_to_wait)) to use the new name; ensure tests and any comments are updated accordingly to reflect the boolean semantic.crates/macros/diag-slow/src/lib.rs (1)
101-112: 🧹 Nitpick | 🔵 TrivialComment on the
async movebranch still mentions "state diagnostics" — misleading.The
async movebranch is taken whenstate_expris absent, so the comment "ownership transfer doesn't conflict with state diagnostics borrowing captured variables" references a feature that does not apply in this branch. The correct explanation is thatmoveis safe here precisely because there is nostate_exprneeding to borrow outer captures.📝 Proposed fix
} else { quote! { - // `move` is safe here: ownership transfer doesn't conflict with - // state diagnostics borrowing captured variables. + // No `state_expr`: no diagnostic closure needs to borrow outer + // captures, so `async move` safely transfers ownership into the + // future and avoids lifetime issues. let __diag_fut = async move `#body`; } };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/macros/diag-slow/src/lib.rs` around lines 101 - 112, The comment on the `async move` branch is misleading because that branch is executed when `state_expr` is None; update the comment around `fut_decl` for the `else` branch (the `let __diag_fut = async move `#body`;` case) to state that `move` is safe because there is no `state_expr` that would need to borrow captured variables (i.e., ownership transfer is fine when `state_expr` is absent), and remove or replace the reference to "state diagnostics" borrowing captured variables to accurately reflect the absence of `state_expr`.crates/macros/diag-slow/Cargo.toml (1)
1-19: LGTM —quoteis now a workspace dependency as previously recommended.The Cargo.toml is correctly configured:
proc-macro = true,syn = { workspace = true, features = ["full"] }, andquote = { workspace = true }now that the root workspace declaresquote = "1.0".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/macros/diag-slow/Cargo.toml` around lines 1 - 19, No change required: the Cargo.toml for the irys-macros-diag-slow crate is correctly configured—ensure proc-macro = true remains in the [lib] section and the dependencies keep syn = { workspace = true, features = ["full"] } and quote = { workspace = true } so the crate uses the workspace-declared quote = "1.0".crates/chain/tests/utils.rs (1)
1447-1461:⚠️ Potential issue | 🟠 MajorHandle
RecvError::Laggedas recoverable in reorg waits.Line 1458 returns an error on lag, which can fail tests even though the receiver can continue. This should continue waiting until timeout/closed instead.
💡 Suggested fix
- let timeout_duration = Duration::from_secs(seconds_to_wait as u64); - match tokio::time::timeout(timeout_duration, reorg_rx.recv()).await { - Ok(Ok(reorg_event)) => { - info!( - "Reorg detected: {} blocks in old fork, {} in new fork, fork at height {}, new tip: {}", - reorg_event.old_fork.len(), - reorg_event.new_fork.len(), - reorg_event.fork_parent.height, - reorg_event.new_tip - ); - Ok(reorg_event) - } - Ok(Err(tokio::sync::broadcast::error::RecvError::Lagged(skipped))) => Err(eyre::eyre!( - "Reorg broadcast receiver lagged and skipped {} events", - skipped - )), - Ok(Err(tokio::sync::broadcast::error::RecvError::Closed)) => { - Err(eyre::eyre!("Reorg broadcast channel closed")) - } - Err(_) => { - let current_tip = self.get_max_difficulty_block().block_hash; - let state = self.diag_wait_state().await; - Err(eyre::eyre!( - "Timeout: No reorg event received within {} seconds (start_tip={}, current_tip={}). State: {}", - seconds_to_wait, - start_tip, - current_tip, - state - )) - } - } + let deadline = Instant::now() + Duration::from_secs(seconds_to_wait as u64); + loop { + let remaining = deadline.saturating_duration_since(Instant::now()); + if remaining.is_zero() { + let current_tip = self.get_max_difficulty_block().block_hash; + let state = self.diag_wait_state().await; + return Err(eyre::eyre!( + "Timeout: No reorg event received within {} seconds (start_tip={}, current_tip={}). State: {}", + seconds_to_wait, + start_tip, + current_tip, + state + )); + } + match tokio::time::timeout(remaining, reorg_rx.recv()).await { + Ok(Ok(reorg_event)) => { + info!( + "Reorg detected: {} blocks in old fork, {} in new fork, fork at height {}, new tip: {}", + reorg_event.old_fork.len(), + reorg_event.new_fork.len(), + reorg_event.fork_parent.height, + reorg_event.new_tip + ); + return Ok(reorg_event); + } + Ok(Err(tokio::sync::broadcast::error::RecvError::Lagged(skipped))) => { + warn!("Reorg receiver lagged; skipped {skipped} events, continuing"); + continue; + } + Ok(Err(tokio::sync::broadcast::error::RecvError::Closed)) => { + return Err(eyre::eyre!("Reorg broadcast channel closed")); + } + Err(_) => continue, + } + }tokio::sync::broadcast::Receiver::recv RecvError::Lagged semantics: is Lagged recoverable and should receiver continue waiting?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/chain/tests/utils.rs` around lines 1447 - 1461, The test currently treats tokio::sync::broadcast::error::RecvError::Lagged as a terminal error when awaiting reorg_rx.recv(), causing spurious failures; change the handling in the match around tokio::time::timeout(timeout_duration, reorg_rx.recv()).await so that when you get RecvError::Lagged(skipped) you log or warn (include skipped) and then continue waiting (i.e., do not return Err, loop until a real event, timeout, or closed), leaving the Ok(Ok(reorg_event)) and other error branches intact; look for the match that calls reorg_rx.recv() and adjust the Lagged arm to be recoverable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/writing-integration-tests/SKILL.md:
- Line 107: The phrase "very large epochs" in the sentence starting with
"Epoch-aware tests need `testing_with_epochs(n)`" is vague—replace it with a
more precise descriptor such as "epochs sized for production, not tests" or
"epochs orders-of-magnitude larger than typical test runs" (or "extremely large"
if you prefer a shorter change) so readers know why `testing_with_epochs(n)` is
needed; update the line in SKILL.md that contains "Epoch-aware tests need
`testing_with_epochs(n)`" accordingly.
In `@crates/chain/src/chain.rs`:
- Around line 208-211: The shutdown send failure is an expected outcome when the
reth receiver has already been dropped, so replace the error-level log in the
self.reth_shutdown_sender.send(reason).await error branch with a non-error log
(e.g., debug! or info!) that notes the reth shutdown channel was already closed
and include the send error for context; update the logging call near the send
invocation to use debug!/info! and a concise message like "reth shutdown channel
closed" plus the error details.
In `@crates/chain/tests/api/hardfork_tests.rs`:
- Line 763: The test function
heavy3_test_borealis_accepts_update_reward_address_post_activation uses an
inconsistent "heavy3_test_" prefix; rename the function to
heavy_test_borealis_accepts_update_reward_address_post_activation to match the
existing heavy_test_ convention, update any references/usages (test attributes,
mod imports or calls) that refer to
heavy3_test_borealis_accepts_update_reward_address_post_activation, and run the
test suite to ensure no other code relies on the old name.
- Around line 164-170: The tracing::debug call is duplicating values because
scanned and skipped are passed as structured fields and also interpolated in the
format string; update the tracing::debug invocation (the call using
tracing::debug! with ?tx_id, scanned, skipped, max_height) to remove the
{scanned} and {skipped} interpolations from the message and leave them only as
structured fields (e.g., change the message to something like "tx not found in
scanned blocks (some heights failed to load)" or simply "tx not found" while
keeping ?tx_id, scanned, skipped, max_height as fields).
In `@crates/chain/tests/block_production/block_production.rs`:
- Around line 1173-1176: The wait predicate currently accepts any state except
NotOnchain(Unknown), which can include transient states and cause a race; update
the predicate passed to wait_for_block_event to require ev.block_hash ==
new_block_hash && ev.state == ChainState::Onchain so the test waits explicitly
for ChainState::Onchain before proceeding (referencing wait_for_block_event,
ChainState::Onchain, new_block_hash and new_block_state).
In `@crates/chain/tests/block_production/difficulty_adjustment.rs`:
- Around line 146-152: The test uses a hardcoded 60s timeout when calling
wait_for_block_event(&mut block_state_rx, 60, ...) which contradicts the earlier
max_seconds value (max_seconds) and a prior review asking for ≤10s; either
switch this call to use the existing max_seconds variable or introduce a clearly
named constant like validation_timeout = 60 with a brief doc comment explaining
why it must exceed max_seconds, then pass that constant into
wait_for_block_event to remove the magic number and make the intent explicit.
In `@crates/chain/tests/data_sync/sync_partition_data_tests.rs`:
- Around line 175-186: The test duplicates numeric literals for slot
expectations (e.g., expected_total = 60, max_data = 50) which can drift from the
setup; replace these hard-coded values in SlotSyncExpectation constructions
(slot0_expectation and slot1_expectation) by deriving them from the shared setup
constants used to build the test data (e.g., total_items, data_items_per_slot or
whatever constants configure packed vs data counts) so expected_total, max_data
and min_packed are computed from those shared variables instead of repeating
60/50/10 literals.
- Around line 246-250: The test captures periodic sync snapshots into
last_genesis_state / last_peer1_state / last_peer2_state but assertions may use
stale values; before any assert that reports node sync state, refresh those
variables by calling genesis_node.sync_state_snapshot().await,
peer1_node.sync_state_snapshot().await, and
peer2_node.sync_state_snapshot().await and reassign to last_genesis_state /
last_peer1_state / last_peer2_state immediately prior to the failure asserts (do
this in each assertion block where snapshot freshness matters, e.g., the blocks
currently setting those variables periodically).
In `@crates/chain/tests/validation/mod.rs`:
- Around line 68-70: Handle RecvError::Lagged explicitly in
read_block_from_state: locate the loop in read_block_from_state that polls
event_rx.try_recv() and add a match arm for RecvError::Lagged which logs a
warning (similar to wait_for_height_update and wait_for_block_event) and then
continues the outer polling loop instead of silently dropping events; ensure the
log message mentions that the broadcast receiver lagged and include relevant
identifiers (e.g., block hash or event type) for observability.
In `@crates/p2p/src/gossip_service.rs`:
- Around line 412-415: The current code logs a warn! unconditionally for the
broadcast_task_handle.wait_for_exit() result; change the handling to match on
broadcast_res (the Result returned by broadcast_task_handle.wait_for_exit()) and
log at info or debug when it is Ok(()) (or Ok value), and only emit warn! when
it is Err(e). Keep the assignment early_exit_result = Some(broadcast_res) but
perform the conditional logging before/after that assignment so normal/expected
exits are logged at a lower level and only errors use warn!.
---
Outside diff comments:
In `@crates/chain/tests/data_sync/sync_partition_data_tests.rs`:
- Around line 197-245: The three repeated blocks that call
check_storage_module_chunks and slot_matches_expectation for genesis, peer1, and
peer2 should be collapsed into a single helper function (e.g., evaluate_node or
check_node_slots) that takes the node handle, a node label string, and the
expectations (slot0_expectation, slot1_expectation) and returns a (synced: bool,
diag: String) tuple; inside the helper call check_storage_module_chunks for
DataLedger::Publish and DataLedger::Submit for slots 0 and 1 and reuse
slot_matches_expectation to compute booleans and diagnostics, then replace the
genesis_synced/peer1_synced/peer2_synced assignments and
last_genesis_diag/last_peer1_diag/last_peer2_diag formatting with calls to this
helper to eliminate duplication and ensure consistent behavior.
In `@crates/chain/tests/multi_node/mempool_tests.rs`:
- Around line 340-344: The test heavy_pending_pledges_test is currently
annotated with #[tokio::test] but configures tracing (sets RUST_LOG and calls
initialize_tracing()), so change the test attribute to use the test_log wrapper:
replace #[tokio::test] with #[test_log::test(tokio::test)] on the
heavy_pending_pledges_test function to ensure tracing output is captured per
project guidelines.
---
Duplicate comments:
In @.claude/skills/debugging-flaky-tests/SKILL.md:
- Line 185: The parameter name chunk in function wait_for_block_in_index(h,
chunk, secs) is misleading because callers pass a boolean; rename the parameter
to something like expect_persistent (or is_persistent) and update the function
signature/type hints, its docstring, all internal logic that treats it as a
boolean, and every call site that passes a boolean (e.g.,
wait_for_block_in_index(migration_height, false, seconds_to_wait)) to use the
new name; ensure tests and any comments are updated accordingly to reflect the
boolean semantic.
In `@crates/actors/src/block_tree_service.rs`:
- Around line 183-190: The drain loop in start() currently only skips
BlockValidationFinished and awaits self.inner.handle_message(msg,
parent_span).await? which will propagate any error (e.g.,
ValidationServiceUnreachable for BlockPreValidated) and abort the drain; change
the logic so that inside the match over BlockTreeServiceMessage you call
self.inner.handle_message(msg, parent_span).await but do not use the ? — instead
match on the Result, log the error with context (including the message variant)
and continue the loop; ensure BlockValidationFinished remains skipped and that
errors from handle_message (for example when handling BlockPreValidated) are
logged and ignored during shutdown rather than returned.
In `@crates/actors/src/services.rs`:
- Around line 228-237: The loop handling tokio::time::timeout_at(deadline,
tokio::time::timeout(idle, rx.recv())) currently collapses three distinct exit
causes into unit () — change the function to return a discriminated enum (e.g.,
Idle, DeadlineExceeded, ChannelClosed or similar) instead of () and have each
match arm return the appropriate variant (map Ok(Ok(Ok(_))) => Idle, Err(_) from
timeout_at => DeadlineExceeded, Ok(Ok(Err(Closed))) => ChannelClosed, and
preserve Lagged as Idle/continue or a distinct variant if desired); update
callers to handle the new enum. Ensure to define the enum near the function and
update match arms that now use tokio::time::timeout_at, tokio::time::timeout,
rx.recv(), and other referenced symbols so exit causes are distinguishable.
In `@crates/chain/tests/block_production/unpledge_refund.rs`:
- Around line 697-709: Replace the fallible `.await?` calls with
`.await.expect("...")` to match the rest of the test and improve CI diagnostics:
after calling `peer_node.mine_block().await?` use
`genesis_node.wait_for_block_at_height(...).await.expect("timed out waiting for
genesis to receive mined block via gossip")` and likewise change the
`genesis_node.wait_for_reth_marker(...).await?` to `.await.expect("timed out
waiting for genesis Reth marker for mined block")`; locate these changes around
the calls to `peer_node.mine_block`, `genesis_node.wait_for_block_at_height`,
and `genesis_node.wait_for_reth_marker` and use descriptive messages consistent
with other occurrences.
- Around line 85-89: The variable name inclusion_block_peer is misleading
because the block is produced by genesis_node.mine_block(); rename it to
something like inclusion_block or genesis_mined_block and update the assertion
message in the wait_for_block_at_height().expect(...) call (currently "genesis
should sync peer-mined inclusion block") to reflect that the block was mined by
genesis (e.g., "genesis should sync its mined inclusion block"); change
references to inclusion_block_peer to the new name and ensure tests continue to
compile and run.
In `@crates/chain/tests/block_production/unstake_refund.rs`:
- Line 1022: The test function
heavy3_unpledge_and_unstake_concurrent_success_flow has an unexplained numeric
prefix "heavy3_"—add a concise doc-comment immediately above the function that
documents the naming convention (e.g., that "heavy"/"heavyN" denotes test
weight/diagnostic/timeout tier and that "3" maps to the #[diag_slow] or specific
timeout/diagnostic level used) so future maintainers understand the suffix;
reference the function name heavy3_unpledge_and_unstake_concurrent_success_flow
and, if relevant, mention the association with #[diag_slow] and any expected
timeout/behavior in that single-line doc-comment.
In `@crates/chain/tests/multi_node/mempool_tests.rs`:
- Around line 1924-1925: The test function slow_heavy3_test_evm_gossip is still
annotated with #[tokio::test] even though you set RUST_LOG and call
initialize_tracing(); replace the attribute with #[test_log::test(tokio::test)]
on the slow_heavy3_test_evm_gossip function so the test harness initializes
logging/tracing properly when running under Tokio.
In `@crates/chain/tests/utils.rs`:
- Around line 1447-1461: The test currently treats
tokio::sync::broadcast::error::RecvError::Lagged as a terminal error when
awaiting reorg_rx.recv(), causing spurious failures; change the handling in the
match around tokio::time::timeout(timeout_duration, reorg_rx.recv()).await so
that when you get RecvError::Lagged(skipped) you log or warn (include skipped)
and then continue waiting (i.e., do not return Err, loop until a real event,
timeout, or closed), leaving the Ok(Ok(reorg_event)) and other error branches
intact; look for the match that calls reorg_rx.recv() and adjust the Lagged arm
to be recoverable.
In `@crates/macros/diag-slow/Cargo.toml`:
- Around line 1-19: No change required: the Cargo.toml for the
irys-macros-diag-slow crate is correctly configured—ensure proc-macro = true
remains in the [lib] section and the dependencies keep syn = { workspace = true,
features = ["full"] } and quote = { workspace = true } so the crate uses the
workspace-declared quote = "1.0".
In `@crates/macros/diag-slow/src/lib.rs`:
- Around line 101-112: The comment on the `async move` branch is misleading
because that branch is executed when `state_expr` is None; update the comment
around `fut_decl` for the `else` branch (the `let __diag_fut = async move
`#body`;` case) to state that `move` is safe because there is no `state_expr` that
would need to borrow captured variables (i.e., ownership transfer is fine when
`state_expr` is absent), and remove or replace the reference to "state
diagnostics" borrowing captured variables to accurately reflect the absence of
`state_expr`.
In `@crates/vdf/src/vdf.rs`:
- Around line 84-86: The fast-forward drain loop (the while let Ok(...) loop in
vdf.rs that consumes fast-forward messages) can starve shutdown because it does
not re-check shutdown_token.is_cancelled() while draining; modify the loop so it
checks shutdown_token.is_cancelled() between iterations (or incorporate the
token into the loop condition) and break immediately if cancelled, ensuring the
VDF loop exits promptly even under sustained fast-forward traffic.
ℹ️ Review info
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (33)
.claude/skills/debugging-flaky-tests/SKILL.md.claude/skills/writing-integration-tests/SKILL.mdCargo.tomlcrates/actors/src/block_tree_service.rscrates/actors/src/services.rscrates/chain/src/chain.rscrates/chain/src/main.rscrates/chain/tests/api/hardfork_tests.rscrates/chain/tests/block_production/block_production.rscrates/chain/tests/block_production/difficulty_adjustment.rscrates/chain/tests/block_production/unpledge_refund.rscrates/chain/tests/block_production/unstake_refund.rscrates/chain/tests/data_sync/sync_partition_data_tests.rscrates/chain/tests/external/block_production.rscrates/chain/tests/integration/cache_service.rscrates/chain/tests/multi_node/ema_forks.rscrates/chain/tests/multi_node/fork_recovery.rscrates/chain/tests/multi_node/mempool_tests.rscrates/chain/tests/multi_node/sync_chain_state.rscrates/chain/tests/multi_node/validation.rscrates/chain/tests/promotion/data_promotion_basic.rscrates/chain/tests/startup/startup.rscrates/chain/tests/term_ledger_expiry/mod.rscrates/chain/tests/utils.rscrates/chain/tests/validation/data_tx_pricing.rscrates/chain/tests/validation/invalid_perm_fee_refund.rscrates/chain/tests/validation/mod.rscrates/chain/tests/validation/unstake_edge_cases.rscrates/macros/diag-slow/Cargo.tomlcrates/macros/diag-slow/src/lib.rscrates/p2p/src/gossip_service.rscrates/p2p/src/tests/integration/mod.rscrates/vdf/src/vdf.rs
| - **`consensus.get_mut()` panics on non-Custom variants** - only works after `NodeConfig::testing()` which creates a `Custom` variant | ||
| - **Fund signers before starting nodes** — genesis accounts are baked into genesis block | ||
| - **`wait_for_mempool` before mining on genesis** when txs originate from a peer — gossip takes time | ||
| - **Epoch-aware tests need `testing_with_epochs(n)`** — default config has very large epochs |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Nit: "very large epochs" — consider a more precise alternative.
The static analysis tooling flagged "very" as an overused intensifier. Replacing it with a concrete descriptor (e.g., "extremely large" or simply "epochs sized for production, not tests") would be more informative.
🧰 Tools
🪛 LanguageTool
[style] ~107-~107: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ..._with_epochs(n)`** — default config has very large epochs - Block migration depth — bl...
(EN_WEAK_ADJECTIVE)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/writing-integration-tests/SKILL.md at line 107, The phrase
"very large epochs" in the sentence starting with "Epoch-aware tests need
`testing_with_epochs(n)`" is vague—replace it with a more precise descriptor
such as "epochs sized for production, not tests" or "epochs orders-of-magnitude
larger than typical test runs" (or "extremely large" if you prefer a shorter
change) so readers know why `testing_with_epochs(n)` is needed; update the line
in SKILL.md that contains "Epoch-aware tests need `testing_with_epochs(n)`"
accordingly.
| tracing::debug!( | ||
| "Transaction {} not found in blocks 1..={}", | ||
| tx_id, | ||
| max_height | ||
| ?tx_id, | ||
| scanned, | ||
| skipped, | ||
| max_height, | ||
| "tx not found in {scanned} blocks ({skipped} heights failed to load)" | ||
| ); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Structured fields scanned and skipped are double-recorded in the log event.
In tracing macros, bare-identifier fields (scanned,, skipped,) are captured as separate key-value pairs and the format-string {scanned} / {skipped} also interpolates those locals into the message field. The result is that both values appear twice in every structured log event.
Either keep the values only as structured fields (clean, machine-parseable) or embed them only in the message string — not both.
♻️ Proposed fix – remove interpolation from the message to avoid duplication
tracing::debug!(
?tx_id,
scanned,
skipped,
max_height,
- "tx not found in {scanned} blocks ({skipped} heights failed to load)"
+ "tx not found in blocks"
);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| tracing::debug!( | |
| "Transaction {} not found in blocks 1..={}", | |
| tx_id, | |
| max_height | |
| ?tx_id, | |
| scanned, | |
| skipped, | |
| max_height, | |
| "tx not found in {scanned} blocks ({skipped} heights failed to load)" | |
| ); | |
| tracing::debug!( | |
| ?tx_id, | |
| scanned, | |
| skipped, | |
| max_height, | |
| "tx not found in blocks" | |
| ); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/chain/tests/api/hardfork_tests.rs` around lines 164 - 170, The
tracing::debug call is duplicating values because scanned and skipped are passed
as structured fields and also interpolated in the format string; update the
tracing::debug invocation (the call using tracing::debug! with ?tx_id, scanned,
skipped, max_height) to remove the {scanned} and {skipped} interpolations from
the message and leave them only as structured fields (e.g., change the message
to something like "tx not found in scanned blocks (some heights failed to load)"
or simply "tx not found" while keeping ?tx_id, scanned, skipped, max_height as
fields).
|
|
||
| // 60s timeout (not max_seconds) because CI validation can be slower than mining | ||
| wait_for_block_event(&mut block_state_rx, 60, |ev| { | ||
| ev.block_hash == block.block_hash | ||
| && matches!(ev.validation_result, ValidationResult::Valid) | ||
| }) | ||
| .await?; |
There was a problem hiding this comment.
60s timeout may still be excessive — reviewer previously requested ≤ 10s.
The comment on line 147 explains the rationale ("CI validation can be slower than mining"), but a prior review round explicitly asked for ≤ 10s. max_seconds is already defined as 10 at line 76. If 60s is intentional, consider aligning by using a named constant (e.g., let validation_timeout = 60;) with a doc comment, rather than an inline magic number that contradicts the test's own max_seconds.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/chain/tests/block_production/difficulty_adjustment.rs` around lines
146 - 152, The test uses a hardcoded 60s timeout when calling
wait_for_block_event(&mut block_state_rx, 60, ...) which contradicts the earlier
max_seconds value (max_seconds) and a prior review asking for ≤10s; either
switch this call to use the existing max_seconds variable or introduce a clearly
named constant like validation_timeout = 60 with a brief doc comment explaining
why it must exceed max_seconds, then pass that constant into
wait_for_block_event to remove the magic number and make the intent explicit.
| // Slot(0) can be either mixed (data + packed) or fully packed. | ||
| // Slot(1) is expected to be fully packed in this test scenario. | ||
| let slot0_expectation = SlotSyncExpectation { | ||
| expected_total: 60, | ||
| max_data: 50, | ||
| min_packed: 10, | ||
| }; | ||
| let slot1_expectation = SlotSyncExpectation { | ||
| expected_total: 60, | ||
| max_data: 0, | ||
| min_packed: 60, | ||
| }; |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Derive slot expectations from single sources to avoid drift.
60 and 50 are duplicated across setup and expectations. Keeping these derived from shared values makes this test less brittle during future config tweaks.
♻️ Proposed refactor
- let slot0_expectation = SlotSyncExpectation {
- expected_total: 60,
- max_data: 50,
- min_packed: 10,
- };
- let slot1_expectation = SlotSyncExpectation {
- expected_total: 60,
- max_data: 0,
- min_packed: 60,
- };
+ let slot_total_chunks = 60usize;
+ let slot0_expectation = SlotSyncExpectation {
+ expected_total: slot_total_chunks,
+ max_data: num_chunks,
+ min_packed: slot_total_chunks.saturating_sub(num_chunks),
+ };
+ let slot1_expectation = SlotSyncExpectation {
+ expected_total: slot_total_chunks,
+ max_data: 0,
+ min_packed: slot_total_chunks,
+ };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/chain/tests/data_sync/sync_partition_data_tests.rs` around lines 175 -
186, The test duplicates numeric literals for slot expectations (e.g.,
expected_total = 60, max_data = 50) which can drift from the setup; replace
these hard-coded values in SlotSyncExpectation constructions (slot0_expectation
and slot1_expectation) by deriving them from the shared setup constants used to
build the test data (e.g., total_items, data_items_per_slot or whatever
constants configure packed vs data counts) so expected_total, max_data and
min_packed are computed from those shared variables instead of repeating
60/50/10 literals.
| if attempt % 5 == 0 || attempt == 79 || (genesis_synced && peer1_synced && peer2_synced) { | ||
| last_genesis_state = genesis_node.sync_state_snapshot().await; | ||
| last_peer1_state = peer1_node.sync_state_snapshot().await; | ||
| last_peer2_state = peer2_node.sync_state_snapshot().await; | ||
| } |
There was a problem hiding this comment.
Refresh sync-state snapshots immediately before failure asserts.
last_*_state is updated periodically, so assertion messages can report stale node state on failure. Capture one final snapshot right before asserting.
🛠️ Proposed fix
+ if !(genesis_synced && peer1_synced && peer2_synced) {
+ last_genesis_state = genesis_node.sync_state_snapshot().await;
+ last_peer1_state = peer1_node.sync_state_snapshot().await;
+ last_peer2_state = peer2_node.sync_state_snapshot().await;
+ }
+
assert!(
genesis_synced,
"Genesis node failed to sync data. chunks={} state={}",
last_genesis_diag, last_genesis_state
);Also applies to: 272-276, 284-298
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/chain/tests/data_sync/sync_partition_data_tests.rs` around lines 246 -
250, The test captures periodic sync snapshots into last_genesis_state /
last_peer1_state / last_peer2_state but assertions may use stale values; before
any assert that reports node sync state, refresh those variables by calling
genesis_node.sync_state_snapshot().await,
peer1_node.sync_state_snapshot().await, and
peer2_node.sync_state_snapshot().await and reassign to last_genesis_state /
last_peer1_state / last_peer2_state immediately prior to the failure asserts (do
this in each assertion block where snapshot freshness matters, e.g., the blocks
currently setting those variables periodically).
| let event_rx = node_ctx.service_senders.subscribe_block_state_updates(); | ||
| send_block_to_block_tree(node_ctx, block.clone(), skip_vdf_validation).await?; | ||
| Ok(read_block_from_state(node_ctx, &block.header().block_hash, event_rx).await) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check how read_block_from_state handles RecvError::Lagged
rg -n "Lagged\|RecvError\|recv_error\|broadcast.*recv" --type rust crates/chain/tests/utils.rs | head -40Repository: Irys-xyz/irys
Length of output: 39
🏁 Script executed:
#!/bin/bash
# Read the read_block_from_state function
sed -n '3641,3696p' crates/chain/tests/utils.rsRepository: Irys-xyz/irys
Length of output: 2416
🏁 Script executed:
#!/bin/bash
# Search for broadcast receiver patterns and other wait helpers
rg -n "broadcast\|\.recv\(\)\|try_recv\|event_receiver" --type rust crates/chain/tests/utils.rs | head -50Repository: Irys-xyz/irys
Length of output: 39
🏁 Script executed:
#!/bin/bash
# Search more broadly for broadcast receiver patterns and Lagged handling
rg -n "\.try_recv\(\)|\.recv\(\)|Lagged" --type rust crates/chain/tests/ | head -60Repository: Irys-xyz/irys
Length of output: 843
🏁 Script executed:
#!/bin/bash
# Examine how Lagged is handled in different wait helpers
echo "=== Handler at line 1062 ==="
sed -n '1050,1070p' crates/chain/tests/utils.rs
echo -e "\n=== Handler at line 1458 ==="
sed -n '1445,1465p' crates/chain/tests/utils.rs
echo -e "\n=== Handler at line 3715 ==="
sed -n '3710,3730p' crates/chain/tests/utils.rsRepository: Irys-xyz/irys
Length of output: 3110
🏁 Script executed:
#!/bin/bash
# Search for try_recv documentation or other usages in the codebase
rg -n "try_recv" --type rust crates/chain/tests/ -A 3 | head -80Repository: Irys-xyz/irys
Length of output: 482
🏁 Script executed:
#!/bin/bash
# Verify tokio broadcast::Receiver::try_recv behavior with Tokio documentation
web_search "tokio broadcast Receiver try_recv Lagged behavior"Repository: Irys-xyz/irys
Length of output: 107
🌐 Web query:
tokio sync broadcast Receiver try_recv documentation TryRecvError
💡 Result:
tokio::sync::broadcast::Receiver::try_recv(&mut self) -> Result<T, TryRecvError> attempts to receive without awaiting and can fail with three TryRecvError variants: (docs.rs)
TryRecvError::Empty: no message is currently available but at least oneSenderstill exists (more messages may arrive). (docs.rs)TryRecvError::Closed: allSenderhandles have been dropped, so no further messages will ever be sent. (docs.rs)TryRecvError::Lagged(u64): this receiver fell behind far enough that older messages were overwritten; theu64is the number of skipped messages. After this error, the receiver’s cursor is moved forward so the next receive returns the oldest message still retained (if any). (docs.rs)
Citations:
- 1: https://docs.rs/tokio/latest/tokio/sync/broadcast/struct.Receiver.html?utm_source=openai
- 2: https://docs.rs/tokio/latest/tokio/sync/broadcast/error/enum.TryRecvError.html?utm_source=openai
- 3: https://docs.rs/tokio/latest/tokio/sync/broadcast/struct.Receiver.html?utm_source=openai
- 4: https://docs.rs/tokio/latest/tokio/sync/broadcast/struct.Receiver.html?utm_source=openai
read_block_from_state silently loses broadcast events on RecvError::Lagged — add explicit handling for consistency.
The function correctly continues polling via the outer loop when try_recv() encounters Lagged, so test failures are not caused by treating Lagged as fatal. However, validation events are silently discarded when the broadcast receiver falls behind, relying entirely on the block-tree fallback polling to detect state changes.
This pattern is inconsistent with other wait helpers in the same file (wait_for_height_update, wait_for_block_event) that explicitly match RecvError::Lagged and log warnings before continuing. Add an explicit Lagged case with logging to improve observability and align with the existing error-handling convention.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/chain/tests/validation/mod.rs` around lines 68 - 70, Handle
RecvError::Lagged explicitly in read_block_from_state: locate the loop in
read_block_from_state that polls event_rx.try_recv() and add a match arm for
RecvError::Lagged which logs a warning (similar to wait_for_height_update and
wait_for_block_event) and then continues the outer polling loop instead of
silently dropping events; ensure the log message mentions that the broadcast
receiver lagged and include relevant identifiers (e.g., block hash or event
type) for observability.
| broadcast_res = broadcast_task_handle.wait_for_exit() => { | ||
| warn!("Gossip broadcast exited because: {:?}", broadcast_res); | ||
| early_exit_result = Some(broadcast_res); | ||
| } |
There was a problem hiding this comment.
Conditionally log early broadcast exit severity
On Line 413, warn! is emitted even when wait_for_exit() returns Ok(()). That can reintroduce shutdown warning noise for expected exits. Please log success at info/debug and reserve warn for Err(_).
Proposed patch
- broadcast_res = broadcast_task_handle.wait_for_exit() => {
- warn!("Gossip broadcast exited because: {:?}", broadcast_res);
- early_exit_result = Some(broadcast_res);
- }
+ broadcast_res = broadcast_task_handle.wait_for_exit() => {
+ match &broadcast_res {
+ Ok(()) => info!("Gossip broadcast exited before watcher shutdown"),
+ Err(err) => warn!("Gossip broadcast exited with error: {err:?}"),
+ }
+ early_exit_result = Some(broadcast_res);
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| broadcast_res = broadcast_task_handle.wait_for_exit() => { | |
| warn!("Gossip broadcast exited because: {:?}", broadcast_res); | |
| early_exit_result = Some(broadcast_res); | |
| } | |
| broadcast_res = broadcast_task_handle.wait_for_exit() => { | |
| match &broadcast_res { | |
| Ok(()) => info!("Gossip broadcast exited before watcher shutdown"), | |
| Err(err) => warn!("Gossip broadcast exited with error: {err:?}"), | |
| } | |
| early_exit_result = Some(broadcast_res); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/p2p/src/gossip_service.rs` around lines 412 - 415, The current code
logs a warn! unconditionally for the broadcast_task_handle.wait_for_exit()
result; change the handling to match on broadcast_res (the Result returned by
broadcast_task_handle.wait_for_exit()) and log at info or debug when it is
Ok(()) (or Ok value), and only emit warn! when it is Err(e). Keep the assignment
early_exit_result = Some(broadcast_res) but perform the conditional logging
before/after that assignment so normal/expected exits are logged at a lower
level and only errors use warn!.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
crates/chain/tests/block_production/unpledge_refund.rs (2)
87-89:⚠️ Potential issue | 🟡 MinorRename misleading inclusion block wording.
This block is mined via
genesis_node.mine_block(), so theinclusion_block_peernaming and"peer-mined"expectation text are inaccurate and can mislead failure triage.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/chain/tests/block_production/unpledge_refund.rs` around lines 87 - 89, The test's variable and failure message are misleading: change the variable name inclusion_block_peer to reflect it was mined by the genesis node (e.g., inclusion_block_genesis) and update the expect() message passed to wait_for_block_at_height to say "genesis-mined inclusion block" (or similar) instead of "peer-mined inclusion block"; update all references to inclusion_block_peer in this test (including the .height access and any other uses) to the new name to keep names consistent with genesis_node.mine_block().
699-702:⚠️ Potential issue | 🟡 MinorAdd contextual failure message for this wait call.
This
wait_for_block_at_height(...).await?path is still inconsistent with nearby call sites that use.expect("...")context, which makes timeout diagnosis harder in CI logs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/chain/tests/block_production/unpledge_refund.rs` around lines 699 - 702, Replace the fallible .await? on genesis_node.wait_for_block_at_height with an explicit panic message so timeouts show context; call wait_for_block_at_height(inclusion_block.height, seconds_to_wait).await.expect(...) and include the height (inclusion_block.height) and timeout (seconds_to_wait) in the message to make CI failures clear (e.g. "genesis node did not receive block at height {height} within {secs}s").
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@crates/chain/tests/block_production/unpledge_refund.rs`:
- Around line 87-89: The test's variable and failure message are misleading:
change the variable name inclusion_block_peer to reflect it was mined by the
genesis node (e.g., inclusion_block_genesis) and update the expect() message
passed to wait_for_block_at_height to say "genesis-mined inclusion block" (or
similar) instead of "peer-mined inclusion block"; update all references to
inclusion_block_peer in this test (including the .height access and any other
uses) to the new name to keep names consistent with genesis_node.mine_block().
- Around line 699-702: Replace the fallible .await? on
genesis_node.wait_for_block_at_height with an explicit panic message so timeouts
show context; call wait_for_block_at_height(inclusion_block.height,
seconds_to_wait).await.expect(...) and include the height
(inclusion_block.height) and timeout (seconds_to_wait) in the message to make CI
failures clear (e.g. "genesis node did not receive block at height {height}
within {secs}s").
…ed log - Wait for ChainState::Onchain explicitly instead of any non-Unknown state - Rename heavy3_test_ → heavy_test_ for single-node borealis test - Downgrade reth shutdown channel closed log from error to debug
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/chain/tests/block_production/block_production.rs (1)
1572-1572:⚠️ Potential issue | 🟡 Minor
#[tokio::test]should be#[test_log::test(tokio::test)]for consistency.Every other async test in this file uses
#[test_log::test(tokio::test)]. This test also sets up tracing (viainitialize_tracing()at line 1583), making it a candidate for the standard attribute. Using the plain macro leaves the test outside the unified tracing/log-capture harness used across the suite.🛠️ Proposed fix
-#[tokio::test] +#[test_log::test(tokio::test)] /// test that config option max_commitment_txs_per_block is enforcedIf
initialize_tracing()is intentionally kept to configure a specific subscriber, ensure it doesn't conflict withtest_log's subscriber registration (e.g., guard against "a global default trace dispatcher has already been set" panics).As per coding guidelines: "Use
#[test_log::test(tokio::test)]for tests that require tracing output".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/chain/tests/block_production/block_production.rs` at line 1572, Replace the plain #[tokio::test] attribute on the test with #[test_log::test(tokio::test)] to ensure it participates in the suite-wide tracing/log-capture harness; update the test declaration that currently uses #[tokio::test] to #[test_log::test(tokio::test)] and, in the test body where initialize_tracing() is called, ensure you guard against re-registering a global subscriber (e.g., make initialize_tracing() idempotent or skip setting a default if one is already set) so it won’t panic when test_log has already registered a subscriber.
♻️ Duplicate comments (2)
crates/chain/src/chain.rs (1)
216-222:⚠️ Potential issue | 🟠 MajorAlign the reth stop timeout with the ServiceSet shutdown budget.
Line [1272] can take
~10s × service_count, but Line [216] is capped byRETH_THREAD_STOP_TIMEOUT(Line [405], 60s). With the current service list size, normal shutdown can exceed 60s and produce spurious timeout errors.Suggested fix
+/// Timeout for ServiceSet graceful shutdown aggregation. +const SERVICE_SET_TIMEOUT: Duration = Duration::from_secs(120); /// Timeout for the reth thread to complete its full shutdown sequence. -/// Budget: send(5s) + actor(25s) + service_set(~10 services × 10s) + task_manager(10s). -/// Using 60s to cover typical case with margin; worst-case depends on service count. -const RETH_THREAD_STOP_TIMEOUT: Duration = Duration::from_secs(60); +/// Budget: send(5s) + actor(25s) + service_set(120s) + task_manager(10s) + margin. +const RETH_THREAD_STOP_TIMEOUT: Duration = Duration::from_secs(170); - service_set.graceful_shutdown().await; + match tokio::time::timeout(SERVICE_SET_TIMEOUT, service_set.graceful_shutdown()).await { + Ok(()) => debug!("ServiceSet graceful shutdown complete"), + Err(_) => warn!("ServiceSet graceful shutdown timed out after {SERVICE_SET_TIMEOUT:?}"), + }Also applies to: 390-405, 1270-1272
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/chain/src/chain.rs` around lines 216 - 222, The fixed RETH_THREAD_STOP_TIMEOUT is causing spurious timeouts because shutdown can take ~10s × service_count; replace uses of RETH_THREAD_STOP_TIMEOUT in the tokio::time::timeout calls (the rx match in chain.rs and the other occurrences around 390-405 and 1270-1272) with the ServiceSet shutdown budget calculation (e.g., obtain the ServiceSet shutdown_budget or compute per_service_timeout * service_count — the same value used for ServiceSet shutdown) so the reth stop wait uses the same total timeout as the ServiceSet shutdown budget.crates/chain/tests/api/hardfork_tests.rs (1)
164-170:⚠️ Potential issue | 🟡 MinorRemove duplicated
scanned/skippedvalues from the debug event.
scannedandskippedare already structured fields; interpolating them again in the message duplicates the same data.Suggested fix
tracing::debug!( ?tx_id, scanned, skipped, max_height, - "tx not found in {scanned} blocks ({skipped} heights failed to load)" + "tx not found in scanned blocks" );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/chain/tests/api/hardfork_tests.rs` around lines 164 - 170, The tracing::debug call currently interpolates scanned and skipped both as structured fields and again inside the message string; remove that duplication by leaving scanned and skipped as structured fields (they are already passed as ?scanned and ?skipped) and change the message to not include {scanned} or {skipped} (e.g. "tx not found in scanned blocks (some heights failed to load)") in the tracing::debug invocation that includes ?tx_id, ?scanned, ?skipped, ?max_height.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@crates/chain/tests/block_production/block_production.rs`:
- Line 1572: Replace the plain #[tokio::test] attribute on the test with
#[test_log::test(tokio::test)] to ensure it participates in the suite-wide
tracing/log-capture harness; update the test declaration that currently uses
#[tokio::test] to #[test_log::test(tokio::test)] and, in the test body where
initialize_tracing() is called, ensure you guard against re-registering a global
subscriber (e.g., make initialize_tracing() idempotent or skip setting a default
if one is already set) so it won’t panic when test_log has already registered a
subscriber.
---
Duplicate comments:
In `@crates/chain/src/chain.rs`:
- Around line 216-222: The fixed RETH_THREAD_STOP_TIMEOUT is causing spurious
timeouts because shutdown can take ~10s × service_count; replace uses of
RETH_THREAD_STOP_TIMEOUT in the tokio::time::timeout calls (the rx match in
chain.rs and the other occurrences around 390-405 and 1270-1272) with the
ServiceSet shutdown budget calculation (e.g., obtain the ServiceSet
shutdown_budget or compute per_service_timeout * service_count — the same value
used for ServiceSet shutdown) so the reth stop wait uses the same total timeout
as the ServiceSet shutdown budget.
In `@crates/chain/tests/api/hardfork_tests.rs`:
- Around line 164-170: The tracing::debug call currently interpolates scanned
and skipped both as structured fields and again inside the message string;
remove that duplication by leaving scanned and skipped as structured fields
(they are already passed as ?scanned and ?skipped) and change the message to not
include {scanned} or {skipped} (e.g. "tx not found in scanned blocks (some
heights failed to load)") in the tracing::debug invocation that includes ?tx_id,
?scanned, ?skipped, ?max_height.
ℹ️ Review info
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
crates/chain/src/chain.rscrates/chain/tests/api/hardfork_tests.rscrates/chain/tests/block_production/block_production.rs
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
crates/chain/tests/block_production/unstake_refund.rs (1)
1100-1100: 🧹 Nitpick | 🔵 Trivial
heavy3_prefix is still unexplained.The rename from
heavy_unpledge_and_unstake_concurrent_success_flowtoheavy3_unpledge_and_unstake_concurrent_success_flowintroduces a numeric prefix that differs from all otherheavy_tests in this file, and there is no inline comment or doc note explaining the convention (e.g., test tier,#[diag_slow]level). Either rename back to the established pattern or document whatheavy3_signifies.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/chain/tests/block_production/unstake_refund.rs` at line 1100, The test function heavy3_unpledge_and_unstake_concurrent_success_flow uses an unexplained numeric prefix; either revert the name to the established heavy_unpledge_and_unstake_concurrent_success_flow or add a one-line comment or docstring above heavy3_unpledge_and_unstake_concurrent_success_flow explaining the meaning of the "3" (e.g., test tier or diag_slow level) so naming is consistent and discoverable; update any references to the test name if you rename the function and ensure any test attributes like #[diag_slow] are mentioned in the comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/chain/tests/block_production/unstake_refund.rs`:
- Around line 872-911: The function get_block_receipts double-counts the timeout
by waiting up to seconds_to_wait in wait_for_reth_marker and then again using
max_retries for the receipts polling loop; change the polling loop to use a
small fixed retry window (e.g., 3–5 retries or a fixed few-second total like 5s)
instead of reusing seconds_to_wait so you don't wait a second full timeout after
wait_for_reth_marker returns; update the variables around poll_interval and
max_retries (used in the for retry loop and the final error message) accordingly
and keep the wait_for_reth_marker call and the receipts_by_block check logic in
get_block_receipts unchanged except for the reduced retry budget.
- Around line 872-878: Remove the redundant reth_ctx parameter from the async
function get_block_receipts and all its call sites; instead access the reth
adapter via node.node_ctx.reth_node_adapter inside get_block_receipts. Update
the function signature to drop the reth_ctx parameter, replace any uses of
reth_ctx inside get_block_receipts with node.node_ctx.reth_node_adapter, and
update every caller (e.g., where genesis_node and
genesis_node.node_ctx.reth_node_adapter.clone() were both passed) to pass only
node; ensure types match (clone if needed) when using
node.node_ctx.reth_node_adapter at call sites.
---
Duplicate comments:
In `@crates/chain/tests/block_production/unstake_refund.rs`:
- Line 1100: The test function
heavy3_unpledge_and_unstake_concurrent_success_flow uses an unexplained numeric
prefix; either revert the name to the established
heavy_unpledge_and_unstake_concurrent_success_flow or add a one-line comment or
docstring above heavy3_unpledge_and_unstake_concurrent_success_flow explaining
the meaning of the "3" (e.g., test tier or diag_slow level) so naming is
consistent and discoverable; update any references to the test name if you
rename the function and ensure any test attributes like #[diag_slow] are
mentioned in the comment.
| async fn get_block_receipts( | ||
| node: &IrysNodeTest<irys_chain::IrysNodeCtx>, | ||
| reth_ctx: &irys_reth_node_bridge::IrysRethNodeAdapter, | ||
| block_height: u64, | ||
| block_hash: FixedBytes<32>, | ||
| seconds_to_wait: usize, | ||
| ) -> eyre::Result<Vec<reth::primitives::Receipt>> { | ||
| reth_ctx | ||
| .inner | ||
| .provider | ||
| .receipts_by_block(HashOrNumber::Hash(block_hash))? | ||
| .ok_or_else(|| eyre::eyre!("Receipts not found for block {:?}", block_hash)) | ||
| node.wait_for_reth_marker( | ||
| BlockNumberOrTag::Number(block_height), | ||
| block_hash, | ||
| seconds_to_wait as u64, | ||
| ) | ||
| .await?; | ||
|
|
||
| let poll_interval = Duration::from_millis(200); | ||
| let max_retries = ((seconds_to_wait * 1000) / poll_interval.as_millis() as usize).max(1); | ||
|
|
||
| for retry in 0..=max_retries { | ||
| if let Some(receipts) = reth_ctx | ||
| .inner | ||
| .provider | ||
| .receipts_by_block(HashOrNumber::Hash(block_hash))? | ||
| { | ||
| return Ok(receipts); | ||
| } | ||
|
|
||
| if retry == max_retries { | ||
| break; | ||
| } | ||
| sleep(poll_interval).await; | ||
| } | ||
|
|
||
| Err(eyre::eyre!( | ||
| "Receipts not found for block {:?} at height {} after {} retries ({}s)", | ||
| block_hash, | ||
| block_height, | ||
| max_retries, | ||
| seconds_to_wait | ||
| )) | ||
| } |
There was a problem hiding this comment.
get_block_receipts double-counts the timeout: effective max wait is 2 × seconds_to_wait.
wait_for_reth_marker already burns up to seconds_to_wait seconds waiting for the block to land in reth. The max_retries calculation then budgets the same seconds_to_wait a second time for the receipts polling loop, so the worst-case wall-clock wait is 40 s when seconds_to_wait = 20. Because wait_for_reth_marker guarantees the block is committed before returning, receipts are almost always available on the first poll; a much shorter retry budget (e.g., a fixed few seconds) is sufficient after a successful marker wait.
♻️ Proposed fix
- let poll_interval = Duration::from_millis(200);
- let max_retries = ((seconds_to_wait * 1000) / poll_interval.as_millis() as usize).max(1);
+ // After wait_for_reth_marker succeeds the block is already in reth; receipts
+ // should appear quickly — use a short fixed budget rather than re-spending the
+ // full seconds_to_wait.
+ let poll_interval = Duration::from_millis(200);
+ let max_retries = 25_usize; // 25 × 200 ms = 5 s safety margin📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async fn get_block_receipts( | |
| node: &IrysNodeTest<irys_chain::IrysNodeCtx>, | |
| reth_ctx: &irys_reth_node_bridge::IrysRethNodeAdapter, | |
| block_height: u64, | |
| block_hash: FixedBytes<32>, | |
| seconds_to_wait: usize, | |
| ) -> eyre::Result<Vec<reth::primitives::Receipt>> { | |
| reth_ctx | |
| .inner | |
| .provider | |
| .receipts_by_block(HashOrNumber::Hash(block_hash))? | |
| .ok_or_else(|| eyre::eyre!("Receipts not found for block {:?}", block_hash)) | |
| node.wait_for_reth_marker( | |
| BlockNumberOrTag::Number(block_height), | |
| block_hash, | |
| seconds_to_wait as u64, | |
| ) | |
| .await?; | |
| let poll_interval = Duration::from_millis(200); | |
| let max_retries = ((seconds_to_wait * 1000) / poll_interval.as_millis() as usize).max(1); | |
| for retry in 0..=max_retries { | |
| if let Some(receipts) = reth_ctx | |
| .inner | |
| .provider | |
| .receipts_by_block(HashOrNumber::Hash(block_hash))? | |
| { | |
| return Ok(receipts); | |
| } | |
| if retry == max_retries { | |
| break; | |
| } | |
| sleep(poll_interval).await; | |
| } | |
| Err(eyre::eyre!( | |
| "Receipts not found for block {:?} at height {} after {} retries ({}s)", | |
| block_hash, | |
| block_height, | |
| max_retries, | |
| seconds_to_wait | |
| )) | |
| } | |
| async fn get_block_receipts( | |
| node: &IrysNodeTest<irys_chain::IrysNodeCtx>, | |
| reth_ctx: &irys_reth_node_bridge::IrysRethNodeAdapter, | |
| block_height: u64, | |
| block_hash: FixedBytes<32>, | |
| seconds_to_wait: usize, | |
| ) -> eyre::Result<Vec<reth::primitives::Receipt>> { | |
| node.wait_for_reth_marker( | |
| BlockNumberOrTag::Number(block_height), | |
| block_hash, | |
| seconds_to_wait as u64, | |
| ) | |
| .await?; | |
| // After wait_for_reth_marker succeeds the block is already in reth; receipts | |
| // should appear quickly — use a short fixed budget rather than re-spending the | |
| // full seconds_to_wait. | |
| let poll_interval = Duration::from_millis(200); | |
| let max_retries = 25_usize; // 25 × 200 ms = 5 s safety margin | |
| for retry in 0..=max_retries { | |
| if let Some(receipts) = reth_ctx | |
| .inner | |
| .provider | |
| .receipts_by_block(HashOrNumber::Hash(block_hash))? | |
| { | |
| return Ok(receipts); | |
| } | |
| if retry == max_retries { | |
| break; | |
| } | |
| sleep(poll_interval).await; | |
| } | |
| Err(eyre::eyre!( | |
| "Receipts not found for block {:?} at height {} after {} retries ({}s)", | |
| block_hash, | |
| block_height, | |
| max_retries, | |
| seconds_to_wait | |
| )) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/chain/tests/block_production/unstake_refund.rs` around lines 872 -
911, The function get_block_receipts double-counts the timeout by waiting up to
seconds_to_wait in wait_for_reth_marker and then again using max_retries for the
receipts polling loop; change the polling loop to use a small fixed retry window
(e.g., 3–5 retries or a fixed few-second total like 5s) instead of reusing
seconds_to_wait so you don't wait a second full timeout after
wait_for_reth_marker returns; update the variables around poll_interval and
max_retries (used in the for retry loop and the final error message) accordingly
and keep the wait_for_reth_marker call and the receipts_by_block check logic in
get_block_receipts unchanged except for the reduced retry budget.
| async fn get_block_receipts( | ||
| node: &IrysNodeTest<irys_chain::IrysNodeCtx>, | ||
| reth_ctx: &irys_reth_node_bridge::IrysRethNodeAdapter, | ||
| block_height: u64, | ||
| block_hash: FixedBytes<32>, | ||
| seconds_to_wait: usize, | ||
| ) -> eyre::Result<Vec<reth::primitives::Receipt>> { |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the function definition and nearby code
head -c 3000 crates/chain/tests/block_production/unstake_refund.rs | tail -c 2000Repository: Irys-xyz/irys
Length of output: 2057
🏁 Script executed:
# Get the full function definition
sed -n '872,920p' crates/chain/tests/block_production/unstake_refund.rsRepository: Irys-xyz/irys
Length of output: 1432
🏁 Script executed:
# Find all call sites of get_block_receipts
rg -n "get_block_receipts" crates/chain/tests/block_production/unstake_refund.rs -A3 -B1Repository: Irys-xyz/irys
Length of output: 1721
🏁 Script executed:
# Search for the call context around lines 98-105 and 180-187
sed -n '95,110p' crates/chain/tests/block_production/unstake_refund.rsRepository: Irys-xyz/irys
Length of output: 494
🏁 Script executed:
# Check the IrysNodeTest struct definition
fd -e rs | xargs rg -l "struct IrysNodeTest"Repository: Irys-xyz/irys
Length of output: 84
🏁 Script executed:
# Look for reth_node_adapter in relation to IrysNodeTest
rg -n "reth_node_adapter" crates/chain/tests/ --type rust -B5 -A2 | head -60Repository: Irys-xyz/irys
Length of output: 3719
Remove the redundant reth_ctx parameter — it can be accessed through node.node_ctx.reth_node_adapter.
At every call site (lines 98–105, 180–187, 242–245, etc.), both node (&genesis_node) and reth_ctx (genesis_node.node_ctx.reth_node_adapter.clone()) come from the same underlying object. Since IrysNodeTest already exposes reth_node_adapter through node_ctx, the separate parameter is unnecessary. Collapse them by removing reth_ctx from the signature and replace its usages with node.node_ctx.reth_node_adapter, reducing caller boilerplate.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/chain/tests/block_production/unstake_refund.rs` around lines 872 -
878, Remove the redundant reth_ctx parameter from the async function
get_block_receipts and all its call sites; instead access the reth adapter via
node.node_ctx.reth_node_adapter inside get_block_receipts. Update the function
signature to drop the reth_ctx parameter, replace any uses of reth_ctx inside
get_block_receipts with node.node_ctx.reth_node_adapter, and update every caller
(e.g., where genesis_node and genesis_node.node_ctx.reth_node_adapter.clone()
were both passed) to pass only node; ensure types match (clone if needed) when
using node.node_ctx.reth_node_adapter at call sites.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
crates/chain/tests/api/hardfork_tests.rs (1)
165-170:⚠️ Potential issue | 🟡 MinorRemove duplicated
scanned/skippedvalues in the debug log.Line 169 interpolates
{scanned}and{skipped}while those fields are already emitted structurally, so the event records them twice.♻️ Proposed fix
tracing::debug!( ?tx_id, scanned, skipped, max_height, - "tx not found in {scanned} blocks ({skipped} heights failed to load)" + "tx not found in searched blocks" );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/chain/tests/api/hardfork_tests.rs` around lines 165 - 170, The debug/event log currently emits scanned and skipped both as structured fields and again via string interpolation in the message; edit the log call that includes tx_id, scanned, skipped, max_height and the message "tx not found in {scanned} blocks ({skipped} heights failed to load)" to remove the "{scanned}" and "{skipped}" interpolations from the message so those values are only emitted once as structured fields (leave tx_id and max_height usage intact).crates/chain/tests/utils.rs (2)
1446-1483:⚠️ Potential issue | 🟠 Major
RecvError::Laggedis still treated as fatal inwait_for_reorg_subscribed.Line 1464 returns an error on
Lagged, which is a recoverable condition — the receiver is repositioned to the oldest retained message. Under CI load this causes spurious failures. Wrap in a loop, log the lag, and continue waiting until the deadline.Suggested fix
async fn wait_for_reorg_subscribed( &self, mut reorg_rx: tokio::sync::broadcast::Receiver<ReorgEvent>, seconds_to_wait: usize, start_tip: H256, ) -> eyre::Result<ReorgEvent> { - let timeout_duration = Duration::from_secs(seconds_to_wait as u64); - match tokio::time::timeout(timeout_duration, reorg_rx.recv()).await { - Ok(Ok(reorg_event)) => { - info!( - "Reorg detected: {} blocks in old fork, {} in new fork, fork at height {}, new tip: {}", - reorg_event.old_fork.len(), - reorg_event.new_fork.len(), - reorg_event.fork_parent.height, - reorg_event.new_tip - ); - Ok(reorg_event) - } - Ok(Err(tokio::sync::broadcast::error::RecvError::Lagged(skipped))) => Err(eyre::eyre!( - "Reorg broadcast receiver lagged and skipped {} events", - skipped - )), - Ok(Err(tokio::sync::broadcast::error::RecvError::Closed)) => { - Err(eyre::eyre!("Reorg broadcast channel closed")) - } - Err(_) => { - let current_tip = self.get_max_difficulty_block().block_hash; - let state = self.diag_wait_state().await; - Err(eyre::eyre!( - "Timeout: No reorg event received within {} seconds (start_tip={}, current_tip={}). State: {}", - seconds_to_wait, - start_tip, - current_tip, - state - )) + let deadline = tokio::time::Instant::now() + + Duration::from_secs(seconds_to_wait as u64); + loop { + let remaining = deadline.saturating_duration_since(tokio::time::Instant::now()); + match tokio::time::timeout(remaining, reorg_rx.recv()).await { + Ok(Ok(reorg_event)) => { + info!( + "Reorg detected: {} blocks in old fork, {} in new fork, fork at height {}, new tip: {}", + reorg_event.old_fork.len(), + reorg_event.new_fork.len(), + reorg_event.fork_parent.height, + reorg_event.new_tip + ); + return Ok(reorg_event); + } + Ok(Err(tokio::sync::broadcast::error::RecvError::Lagged(skipped))) => { + warn!("Reorg receiver lagged; skipped {skipped} events, continuing"); + continue; + } + Ok(Err(tokio::sync::broadcast::error::RecvError::Closed)) => { + return Err(eyre::eyre!("Reorg broadcast channel closed")); + } + Err(_) => { + let current_tip = self.get_max_difficulty_block().block_hash; + let state = self.diag_wait_state().await; + return Err(eyre::eyre!( + "Timeout: No reorg event received within {} seconds (start_tip={}, current_tip={}). State: {}", + seconds_to_wait, + start_tip, + current_tip, + state + )); + } } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/chain/tests/utils.rs` around lines 1446 - 1483, In wait_for_reorg_subscribed, treat tokio::sync::broadcast::error::RecvError::Lagged as non-fatal by wrapping the recv call in a loop that retries until the timeout_duration elapses: on RecvError::Lagged(skipped) log the skipped count (using info/debug) and continue waiting, on RecvError::Closed return the existing error, and on successful Ok(reorg_event) return it; ensure the timeout uses the original deadline so repeated Lagged occurrences don’t reset the timer and use the existing symbols reorg_rx, timeout_duration, seconds_to_wait, start_tip, get_max_difficulty_block, and diag_wait_state to preserve current timeout/diagnostic behavior.
2249-2272:wait_for_tx_includedstill swallows all errors indiscriminately.Line 2263 catches every
Err(_)as a retry, masking channel/service failures that would never self-resolve. Only "not found" errors should be retried; genuine service failures should propagate immediately for better diagnostics.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/chain/tests/utils.rs` around lines 2249 - 2272, wait_for_tx_included is currently treating every Err from get_storage_tx_header_from_mempool as a transient “retry”, which hides permanent service/channel failures; change the error handling so you only retry when the error represents “not found” (or whatever concrete NotFound variant your storage API returns) and immediately return Err for all other errors. Locate wait_for_tx_included and the call to get_storage_tx_header_from_mempool(tx_id) and replace the blanket Err(_) arm with logic that inspects the error (e.g., pattern-match on the concrete error enum or use an is_not_found helper) — if it’s a not-found case continue polling, otherwise propagate the error (return Err(e)) so real failures surface. Ensure the returned error preserves the original error value/type instead of swallowing it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/chain/tests/utils.rs`:
- Around line 3519-3594: The code auto-starts the VDF in
solution_context_with_poa_chunk by calling node_ctx.start_vdf() then runs an
async block but calls node_ctx.stop_vdf() only after awaiting the block, so a
panic in the block can leave VDF running; fix by making the restore panic-safe —
either create a small RAII guard type (e.g., VdfGuard) that captures node_ctx
and the original was_vdf_enabled and calls node_ctx.stop_vdf() in Drop when it
started the VDF, or use scopeguard::defer! immediately after start_vdf() to
ensure stop_vdf() always runs; apply the identical change to solution_context as
well and reference node_ctx.start_vdf, node_ctx.stop_vdf,
solution_context_with_poa_chunk, and solution_context when making the change.
---
Duplicate comments:
In `@crates/chain/tests/api/hardfork_tests.rs`:
- Around line 165-170: The debug/event log currently emits scanned and skipped
both as structured fields and again via string interpolation in the message;
edit the log call that includes tx_id, scanned, skipped, max_height and the
message "tx not found in {scanned} blocks ({skipped} heights failed to load)" to
remove the "{scanned}" and "{skipped}" interpolations from the message so those
values are only emitted once as structured fields (leave tx_id and max_height
usage intact).
In `@crates/chain/tests/utils.rs`:
- Around line 1446-1483: In wait_for_reorg_subscribed, treat
tokio::sync::broadcast::error::RecvError::Lagged as non-fatal by wrapping the
recv call in a loop that retries until the timeout_duration elapses: on
RecvError::Lagged(skipped) log the skipped count (using info/debug) and continue
waiting, on RecvError::Closed return the existing error, and on successful
Ok(reorg_event) return it; ensure the timeout uses the original deadline so
repeated Lagged occurrences don’t reset the timer and use the existing symbols
reorg_rx, timeout_duration, seconds_to_wait, start_tip,
get_max_difficulty_block, and diag_wait_state to preserve current
timeout/diagnostic behavior.
- Around line 2249-2272: wait_for_tx_included is currently treating every Err
from get_storage_tx_header_from_mempool as a transient “retry”, which hides
permanent service/channel failures; change the error handling so you only retry
when the error represents “not found” (or whatever concrete NotFound variant
your storage API returns) and immediately return Err for all other errors.
Locate wait_for_tx_included and the call to
get_storage_tx_header_from_mempool(tx_id) and replace the blanket Err(_) arm
with logic that inspects the error (e.g., pattern-match on the concrete error
enum or use an is_not_found helper) — if it’s a not-found case continue polling,
otherwise propagate the error (return Err(e)) so real failures surface. Ensure
the returned error preserves the original error value/type instead of swallowing
it.
| let was_vdf_enabled = node_ctx | ||
| .is_vdf_mining_enabled | ||
| .load(std::sync::atomic::Ordering::Relaxed); | ||
| if !was_vdf_enabled { | ||
| node_ctx.start_vdf(); | ||
| } | ||
|
|
||
| let result = async { | ||
| // Ensure the VDF has at least two steps materialized (N-1, N) | ||
| let vdf_steps_guard = node_ctx.vdf_steps_guard.clone(); | ||
| let start = std::time::Instant::now(); | ||
| let max_wait = std::time::Duration::from_secs(5); | ||
| let (step, steps) = loop { | ||
| if start.elapsed() > max_wait { | ||
| return Err(eyre::eyre!( | ||
| "VDF steps unavailable: timed out waiting for (prev,current) pair" | ||
| )); | ||
| } | ||
| let s = vdf_steps_guard.read().global_step; | ||
| if s >= 1 { | ||
| if let Ok(steps) = vdf_steps_guard.read().get_steps(ii(s - 1, s)) { | ||
| if steps.len() >= 2 { | ||
| break (s, steps); | ||
| } | ||
| } | ||
| } | ||
| } | ||
| tokio::time::sleep(std::time::Duration::from_millis(50)).await; | ||
| }; | ||
| tokio::time::sleep(std::time::Duration::from_millis(50)).await; | ||
| }; | ||
|
|
||
| // Compute checkpoints for (step-1) | ||
| let mut hasher = Sha256::new(); | ||
| let mut salt = | ||
| irys_types::U256::from(step_number_to_salt_number(&node_ctx.config.vdf, step - 1)); | ||
| let mut seed = steps[0]; | ||
| let mut checkpoints: Vec<H256> = | ||
| vec![H256::default(); node_ctx.config.vdf.num_checkpoints_in_vdf_step]; | ||
| vdf_sha( | ||
| &mut hasher, | ||
| &mut salt, | ||
| &mut seed, | ||
| node_ctx.config.vdf.num_checkpoints_in_vdf_step, | ||
| node_ctx.config.vdf.num_iterations_per_checkpoint(), | ||
| &mut checkpoints, | ||
| ); | ||
| // Compute checkpoints for (step-1) | ||
| let mut hasher = Sha256::new(); | ||
| let mut salt = | ||
| irys_types::U256::from(step_number_to_salt_number(&node_ctx.config.vdf, step - 1)); | ||
| let mut seed = steps[0]; | ||
| let mut checkpoints: Vec<H256> = | ||
| vec![H256::default(); node_ctx.config.vdf.num_checkpoints_in_vdf_step]; | ||
| vdf_sha( | ||
| &mut hasher, | ||
| &mut salt, | ||
| &mut seed, | ||
| node_ctx.config.vdf.num_checkpoints_in_vdf_step, | ||
| node_ctx.config.vdf.num_iterations_per_checkpoint(), | ||
| &mut checkpoints, | ||
| ); | ||
|
|
||
| // For deterministic linkage without recall-range dependency, use offset 0 | ||
| let partition_hash = H256::zero(); | ||
| let partition_chunk_offset: u32 = 0; | ||
| // For deterministic linkage without recall-range dependency, use offset 0 | ||
| let partition_hash = H256::zero(); | ||
| let partition_chunk_offset: u32 = 0; | ||
|
|
||
| // Compute solution_hash = sha256(poa_chunk || offset_le || vdf_output) | ||
| let mut hasher_sol = Sha256::new(); | ||
| hasher_sol.update(&poa_chunk); | ||
| hasher_sol.update(partition_chunk_offset.to_le_bytes()); | ||
| hasher_sol.update(steps[1].as_bytes()); | ||
| let solution_hash = H256::from_slice(hasher_sol.finalize().as_slice()); | ||
|
|
||
| Ok(SolutionContext { | ||
| partition_hash, | ||
| chunk_offset: partition_chunk_offset, | ||
| mining_address: node_ctx.config.node_config.miner_address(), | ||
| tx_path: None, | ||
| data_path: None, | ||
| chunk: poa_chunk, | ||
| vdf_step: step, | ||
| checkpoints: H256List(checkpoints), | ||
| seed: Seed(steps[1]), | ||
| solution_hash, | ||
| }) | ||
| } | ||
| .await; | ||
|
|
||
| // Compute solution_hash = sha256(poa_chunk || offset_le || vdf_output) | ||
| let mut hasher_sol = Sha256::new(); | ||
| hasher_sol.update(&poa_chunk); | ||
| hasher_sol.update(partition_chunk_offset.to_le_bytes()); | ||
| hasher_sol.update(steps[1].as_bytes()); | ||
| let solution_hash = H256::from_slice(hasher_sol.finalize().as_slice()); | ||
| if !was_vdf_enabled { | ||
| node_ctx.stop_vdf(); | ||
| } | ||
|
|
||
| node_ctx.stop_vdf(); | ||
| Ok(SolutionContext { | ||
| partition_hash, | ||
| chunk_offset: partition_chunk_offset, | ||
| mining_address: node_ctx.config.node_config.miner_address(), | ||
| tx_path: None, | ||
| data_path: None, | ||
| chunk: poa_chunk, | ||
| vdf_step: step, | ||
| checkpoints: H256List(checkpoints), | ||
| seed: Seed(steps[1]), | ||
| solution_hash, | ||
| }) | ||
| result |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
VDF auto-start in solution_context_with_poa_chunk is not panic-safe.
If the inner async block panics (lines 3526–3587), the stop_vdf() call at line 3591 is never reached. While test panics typically abort, in #[should_panic] tests or multi-node scenarios this could leave the VDF running. A scope-guard (e.g. scopeguard::defer!) or refactoring the restore into a Drop wrapper would make this robust.
The same pattern exists in solution_context (lines 3609–3624).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/chain/tests/utils.rs` around lines 3519 - 3594, The code auto-starts
the VDF in solution_context_with_poa_chunk by calling node_ctx.start_vdf() then
runs an async block but calls node_ctx.stop_vdf() only after awaiting the block,
so a panic in the block can leave VDF running; fix by making the restore
panic-safe — either create a small RAII guard type (e.g., VdfGuard) that
captures node_ctx and the original was_vdf_enabled and calls node_ctx.stop_vdf()
in Drop when it started the VDF, or use scopeguard::defer! immediately after
start_vdf() to ensure stop_vdf() always runs; apply the identical change to
solution_context as well and reference node_ctx.start_vdf, node_ctx.stop_vdf,
solution_context_with_poa_chunk, and solution_context when making the change.
0d0df2a to
701ad31
Compare
701ad31 to
63e4edd
Compare
Summary
Overhauls node shutdown, startup, and internal synchronization to eliminate hangs and race conditions — both in production and tests.
Shutdown
JoinHandlejoins withoneshot::Receiverchannels for reth, VDF, and actor threadsstop_and_join_with_timeout()onTokioServiceHandleaborts tasks that exceed their budgetDatabaseProviderArc refs to drop before returning fromstop(), preventing MDBX lock contention on restartCancellationTokenpropagates to VDF, backfill, and actors for coordinated cancellationStartup
bind_listeners()thennew_with_listeners()— TCP ports are known before DB/config creationinitial_sync(); auto-stake uses broadcast quiescence detection instead ofsleep(2s)P2P
Test robustness
send_block_and_read_state()subscribes before sending to prevent validation races#[diag_slow]proc macro emits periodic warnings for hung async operations in CISummary by CodeRabbit
Documentation
Improvements
Tests
Chores