feat: sync diagnostics#1117
Conversation
📝 WalkthroughWalkthroughThreads a diagnostics-enabled ChainSyncState through chain init, BlockTreeService, VDF, and P2P (block pool, gossip, chain sync). Adds SyncDiagnosticInfo storage and recording APIs; code records validation/processing errors, data-pull errors, VDF steps, and successful block processing; diagnostic summaries can be retrieved. Changes
Sequence DiagramsequenceDiagram
participant ChainInit as Chain Init
participant BTS as BlockTreeService
participant VDF as VDF Engine
participant BlockPool as Block Pool
participant Gossip as Gossip Handler
participant DiagState as ChainSyncState
ChainInit->>DiagState: create / clone ChainSyncState
ChainInit->>BTS: spawn_service(chain_sync_state)
ChainInit->>VDF: init_vdf_thread(..., chain_sync_state)
BlockPool->>BlockPool: process_block()
alt validation fails
BlockPool->>DiagState: record_block_validation_error(err)
else forked / processing error
BlockPool->>DiagState: record_block_processing_error(err)
else success
BlockPool->>DiagState: record_successful_block_processing(hash)
end
Gossip->>Gossip: pull_and_process_block(...)
alt pull fails
Gossip->>DiagState: record_data_pull_error(err)
else success
Gossip->>DiagState: record_successful_block_processing(hash)
end
VDF->>VDF: run_vdf loop
VDF->>DiagState: record_vdf_step(step_number)
BTS->>DiagState: get_diagnostic_summary()
DiagState-->>BTS: diagnostic summary
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
crates/p2p/src/block_pool.rs (1)
951-964:⚠️ Potential issue | 🟡 MinorClassify prevalidation failures as validation errors.
handle_blockfailures are validation-stage issues; recording them as processing errors will skew diagnostics. Userecord_block_validation_errorhere.🔧 Suggested fix
- self.sync_state.record_block_processing_error(format!( - "BlockPrevalidationFailed: {:?}", - block_discovery_error - )); + self.sync_state.record_block_validation_error(format!( + "BlockPrevalidationFailed: {:?}", + block_discovery_error + ));crates/p2p/src/gossip_data_handler.rs (1)
756-805:⚠️ Potential issue | 🟠 MajorDon’t return Ok on final payload-handling failure; diagnostics won’t run.
Ifhandle_execution_payloadfails on the last attempt, the function returnsOk(()), so the new post-loop diagnostics never execute and callers think the payload was cached. The invalid-peer early return also bypasses diagnostics. Consider breaking to the shared error path and using a message that matches all failure causes.🔧 Suggested fix to propagate and record final-attempt failures
- None => { - last_err = Some(GossipError::InvalidPeer(format!( - "Peer not found for peer_id {:?}", - source_peer_id - ))); - if attempt < 3 { - continue; - } - return Err(last_err.unwrap()); - } + None => { + last_err = Some(GossipError::InvalidPeer(format!( + "Peer not found for peer_id {:?}", + source_peer_id + ))); + if attempt < 3 { + continue; + } + break; + } ... - if let Err(e) = self + if let Err(e) = self .handle_execution_payload(GossipRequestV2 { peer_id: source_peer_id, miner_address, data: execution_payload, }) .await { last_err = Some(e); if attempt < 3 { continue; } - } - return Ok(()); + break; + } + return Ok(()); ... - self.sync_state.record_data_pull_error(format!( - "pull_payload_from_network failed for {}: {:?}", - evm_block_hash, err - )); + self.sync_state.record_data_pull_error(format!( + "pull_and_add_execution_payload_to_cache failed for {}: {:?}", + evm_block_hash, err + ));
🤖 Fix all issues with AI agents
In `@crates/actors/src/block_tree_service.rs`:
- Around line 608-610: Update the diagnostic recorded by
self.chain_sync_state.record_block_validation_error to include the block's hash
alongside validation_error.to_string(); locate the block being validated in the
surrounding function (e.g., the variable representing the Block or BlockHeader
used when validation fails) and format the message to include both the block
hash (via its canonical accessor, e.g., block.hash() or header.hash()) and the
validation error so logs contain "block=<hash> error=<details>" (or similar)
before calling record_block_validation_error.
In `@crates/domain/src/models/chain_sync_state.rs`:
- Around line 99-134: The summary labels hardcode "most recent 10" which will
become inaccurate if the MAX_LAST_BLOCK_VALIDATION_ERRORS constant changes;
update the three label strings near the loops over
self.last_block_validation_errors, self.last_block_processing_errors, and
self.last_data_pull_errors to interpolate or format the count using the
MAX_LAST_BLOCK_VALIDATION_ERRORS constant (e.g. "Last validation errors (most
recent {}):", "Last processing errors (most recent {}):", "Last data pull errors
(most recent {}):") so the message reflects the actual configured max.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@crates/domain/src/models/chain_sync_state.rs`:
- Around line 94-97: The file has formatting differences causing cargo fmt to
fail around the multi-line format! calls (e.g., the output.push_str(&format!(
... self.total_block_validation_errors ... )) usages) and related lines (96,
113, 130); run cargo fmt and reformat the ChainSyncState display/serialization
code so the format! invocations, argument lists, and closing parens follow
rustfmt style (fix spacing/line breaks around output.push_str, format! and the
self.* fields such as self.total_block_validation_errors) and re-check with
cargo fmt to ensure no remaining style diffs.
- Around line 39-69: Replace the fixed-size FIFO Vecs used to store recent
errors with VecDeque to get O(1) front removals: change the types of
last_block_validation_errors, last_block_processing_errors, and
last_data_pull_errors to VecDeque<(String, SystemTime)>, import
std::collections::VecDeque, and update the three functions
record_block_validation_error, record_block_processing_error, and
record_data_pull_error to push_back the new (error, now) and, when len() >
MAX_LAST_BLOCK_VALIDATION_ERRORS, call pop_front() instead of remove(0); no
change needed for record_vdf_step.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@crates/domain/src/models/chain_sync_state.rs`:
- Around line 11-12: The doc comment claims SyncDiagnosticInfo is only
accessible via ChainSyncState but its methods are public; either make
SyncDiagnosticInfo's methods non-public (e.g., change pub fn to crate:: or
remove pub on methods) to enforce access via ChainSyncState, or update the
comment text to accurately reflect that SyncDiagnosticInfo exposes public
methods. Locate the struct SyncDiagnosticInfo and its impl block(s) and either
tighten method visibility or revise the top-level comment on the struct to
remove the misleading “access via methods on `ChainSyncState`” statement.
JesseTheRobot
left a comment
There was a problem hiding this comment.
overall LGTM, just a bit of nonblocking feedback around how the diagnostic strings are created from errors
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/p2p/src/gossip_data_handler.rs (1)
774-787:⚠️ Potential issue | 🔴 CriticalBug: Function returns
Ok(())whenhandle_execution_payloadfails on the final retry.When
handle_execution_payloadfails on the 3rd attempt, the code setslast_errbut then falls through toreturn Ok(());on line 787, incorrectly reporting success instead of failure.Compare with the peer lookup failure handling (lines 768-771) which correctly returns the error on the final attempt.
🐛 Proposed fix
if let Err(e) = self .handle_execution_payload(GossipRequestV2 { peer_id: source_peer_id, miner_address, data: execution_payload, }) .await { last_err = Some(e); if attempt < 3 { continue; } + // Fall through to error handling after the loop + } else { + return Ok(()); } - return Ok(()); }
🤖 Fix all issues with AI agents
In `@crates/p2p/src/gossip_data_handler.rs`:
- Around line 729-736: The peer-not-found branch in the block pull path (inside
the code that calls self.peer_list.get_peer(&source_peer_id) and returns
GossipError::InvalidPeer for block_hash) misses recording diagnostics; add a
call to self.sync_state.record_data_pull_error(...) with the source_peer_id and
block_hash (matching the usage pattern in pull_and_process_block and
pull_and_process_block_from_peer) before logging/returning, or if omission is
intentional, add a brief comment in pull_block_header explaining callers handle
diagnostics so behavior is explicit.
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/p2p/src/gossip_data_handler.rs (1)
1262-1286: 🧹 Nitpick | 🔵 TrivialMissing diagnostic recording for consistency with other data-pull operations.
pull_block_bodybuilds a detailedfailure_summarybut doesn't record it viasync_state.record_data_pull_error(). This is inconsistent withpull_and_add_execution_payload_to_cache(lines 798-803) which records errors after exhausting retries.🔧 Proposed fix to add diagnostic recording
} else { "Failed to pull block body".to_string() }; + self.sync_state.record_data_pull_error(failure_summary.clone()); Err(GossipError::Internal(InternalGossipError::Unknown( failure_summary, )))
Describe the changes
This PR adds a sync diagnostic data
Related Issue(s)
Checklist
Additional Context
Add any other context about the pull request here.
Summary by CodeRabbit
New Features
Improvements