Skip to content

feat: sync diagnostics#1117

Merged
antouhou merged 13 commits into
masterfrom
feat-sync-diag
Feb 6, 2026
Merged

feat: sync diagnostics#1117
antouhou merged 13 commits into
masterfrom
feat-sync-diag

Conversation

@antouhou

@antouhou antouhou commented Jan 30, 2026

Copy link
Copy Markdown
Contributor

Describe the changes
This PR adds a sync diagnostic data

Related Issue(s)

Checklist

  • Tests have been added/updated for the changes.
  • Documentation has been updated for the changes (if applicable).
  • The code follows Rust's style guidelines.

Additional Context
Add any other context about the pull request here.

Summary by CodeRabbit

  • New Features

    • Added comprehensive chain synchronization diagnostics: error histories, VDF step tracking, last successful block tracking, and a runtime diagnostic summary.
  • Improvements

    • Enhanced observability: records validation, processing, and data-pull errors and successful block processing.
    • Improved logging across block handling, peer pulls, queue timeouts, and VDF execution for easier troubleshooting.

@coderabbitai

coderabbitai Bot commented Jan 30, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Threads a diagnostics-enabled ChainSyncState through chain init, BlockTreeService, VDF, and P2P (block pool, gossip, chain sync). Adds SyncDiagnosticInfo storage and recording APIs; code records validation/processing errors, data-pull errors, VDF steps, and successful block processing; diagnostic summaries can be retrieved.

Changes

Cohort / File(s) Summary
Diagnostic Infrastructure
crates/domain/src/models/chain_sync_state.rs
Adds SyncDiagnosticInfo, diagnostic_info: Arc<RwLock<...>> on ChainSyncState, timestamped/capped diagnostic storage and formatting, plus public recording APIs for block validation/processing errors, VDF steps, data-pull errors, successful block processing, and summary retrieval.
Actor / Service Init
crates/actors/src/block_tree_service.rs, crates/chain/src/chain.rs
Threads ChainSyncState into BlockTreeService::spawn_service / BlockTreeServiceInner and into VDF thread initialization; initializes new field and records block validation errors in on_block_validation_finished.
VDF Execution
crates/vdf/src/vdf.rs
Adds chain_sync_state: ChainSyncState parameter to run_vdf and records VDF step numbers via record_vdf_step(...); updates tests and VDF thread call sites to pass ChainSyncState.
P2P — Block Processing
crates/p2p/src/block_pool.rs
Adds calls to record_block_processing_error(...) on various block error paths and record_successful_block_processing(...) after marking a block processed; control flow otherwise unchanged.
P2P — Gossip & Chain Sync
crates/p2p/src/gossip_data_handler.rs, crates/p2p/src/chain_sync.rs
Gossip handlers now explicitly record data-pull errors to ChainSyncState and return on failure; chain_sync augments timeout logging and prints a full diagnostic summary on the final retry.

Sequence Diagram

sequenceDiagram
    participant ChainInit as Chain Init
    participant BTS as BlockTreeService
    participant VDF as VDF Engine
    participant BlockPool as Block Pool
    participant Gossip as Gossip Handler
    participant DiagState as ChainSyncState

    ChainInit->>DiagState: create / clone ChainSyncState
    ChainInit->>BTS: spawn_service(chain_sync_state)
    ChainInit->>VDF: init_vdf_thread(..., chain_sync_state)

    BlockPool->>BlockPool: process_block()
    alt validation fails
        BlockPool->>DiagState: record_block_validation_error(err)
    else forked / processing error
        BlockPool->>DiagState: record_block_processing_error(err)
    else success
        BlockPool->>DiagState: record_successful_block_processing(hash)
    end

    Gossip->>Gossip: pull_and_process_block(...)
    alt pull fails
        Gossip->>DiagState: record_data_pull_error(err)
    else success
        Gossip->>DiagState: record_successful_block_processing(hash)
    end

    VDF->>VDF: run_vdf loop
    VDF->>DiagState: record_vdf_step(step_number)

    BTS->>DiagState: get_diagnostic_summary()
    DiagState-->>BTS: diagnostic summary
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • JesseTheRobot
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: sync diagnostics' is concise and directly describes the main feature being added throughout the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat-sync-diag

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@antouhou antouhou marked this pull request as ready for review January 30, 2026 16:02

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
crates/p2p/src/block_pool.rs (1)

951-964: ⚠️ Potential issue | 🟡 Minor

Classify prevalidation failures as validation errors.

handle_block failures are validation-stage issues; recording them as processing errors will skew diagnostics. Use record_block_validation_error here.

🔧 Suggested fix
-            self.sync_state.record_block_processing_error(format!(
-                "BlockPrevalidationFailed: {:?}",
-                block_discovery_error
-            ));
+            self.sync_state.record_block_validation_error(format!(
+                "BlockPrevalidationFailed: {:?}",
+                block_discovery_error
+            ));
crates/p2p/src/gossip_data_handler.rs (1)

756-805: ⚠️ Potential issue | 🟠 Major

Don’t return Ok on final payload-handling failure; diagnostics won’t run.
If handle_execution_payload fails on the last attempt, the function returns Ok(()), so the new post-loop diagnostics never execute and callers think the payload was cached. The invalid-peer early return also bypasses diagnostics. Consider breaking to the shared error path and using a message that matches all failure causes.

🔧 Suggested fix to propagate and record final-attempt failures
-                        None => {
-                            last_err = Some(GossipError::InvalidPeer(format!(
-                                "Peer not found for peer_id {:?}",
-                                source_peer_id
-                            )));
-                            if attempt < 3 {
-                                continue;
-                            }
-                            return Err(last_err.unwrap());
-                        }
+                        None => {
+                            last_err = Some(GossipError::InvalidPeer(format!(
+                                "Peer not found for peer_id {:?}",
+                                source_peer_id
+                            )));
+                            if attempt < 3 {
+                                continue;
+                            }
+                            break;
+                        }
...
-                    if let Err(e) = self
+                    if let Err(e) = self
                         .handle_execution_payload(GossipRequestV2 {
                             peer_id: source_peer_id,
                             miner_address,
                             data: execution_payload,
                         })
                         .await
                     {
                         last_err = Some(e);
                         if attempt < 3 {
                             continue;
                         }
-                    }
-                    return Ok(());
+                        break;
+                    }
+                    return Ok(());
...
-        self.sync_state.record_data_pull_error(format!(
-            "pull_payload_from_network failed for {}: {:?}",
-            evm_block_hash, err
-        ));
+        self.sync_state.record_data_pull_error(format!(
+            "pull_and_add_execution_payload_to_cache failed for {}: {:?}",
+            evm_block_hash, err
+        ));
🤖 Fix all issues with AI agents
In `@crates/actors/src/block_tree_service.rs`:
- Around line 608-610: Update the diagnostic recorded by
self.chain_sync_state.record_block_validation_error to include the block's hash
alongside validation_error.to_string(); locate the block being validated in the
surrounding function (e.g., the variable representing the Block or BlockHeader
used when validation fails) and format the message to include both the block
hash (via its canonical accessor, e.g., block.hash() or header.hash()) and the
validation error so logs contain "block=<hash> error=<details>" (or similar)
before calling record_block_validation_error.

In `@crates/domain/src/models/chain_sync_state.rs`:
- Around line 99-134: The summary labels hardcode "most recent 10" which will
become inaccurate if the MAX_LAST_BLOCK_VALIDATION_ERRORS constant changes;
update the three label strings near the loops over
self.last_block_validation_errors, self.last_block_processing_errors, and
self.last_data_pull_errors to interpolate or format the count using the
MAX_LAST_BLOCK_VALIDATION_ERRORS constant (e.g. "Last validation errors (most
recent {}):", "Last processing errors (most recent {}):", "Last data pull errors
(most recent {}):") so the message reflects the actual configured max.

Comment thread crates/actors/src/block_tree_service.rs Outdated
Comment thread crates/domain/src/models/chain_sync_state.rs Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@crates/domain/src/models/chain_sync_state.rs`:
- Around line 94-97: The file has formatting differences causing cargo fmt to
fail around the multi-line format! calls (e.g., the output.push_str(&format!(
... self.total_block_validation_errors ... )) usages) and related lines (96,
113, 130); run cargo fmt and reformat the ChainSyncState display/serialization
code so the format! invocations, argument lists, and closing parens follow
rustfmt style (fix spacing/line breaks around output.push_str, format! and the
self.* fields such as self.total_block_validation_errors) and re-check with
cargo fmt to ensure no remaining style diffs.
- Around line 39-69: Replace the fixed-size FIFO Vecs used to store recent
errors with VecDeque to get O(1) front removals: change the types of
last_block_validation_errors, last_block_processing_errors, and
last_data_pull_errors to VecDeque<(String, SystemTime)>, import
std::collections::VecDeque, and update the three functions
record_block_validation_error, record_block_processing_error, and
record_data_pull_error to push_back the new (error, now) and, when len() >
MAX_LAST_BLOCK_VALIDATION_ERRORS, call pop_front() instead of remove(0); no
change needed for record_vdf_step.

Comment thread crates/domain/src/models/chain_sync_state.rs
Comment thread crates/domain/src/models/chain_sync_state.rs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@crates/domain/src/models/chain_sync_state.rs`:
- Around line 11-12: The doc comment claims SyncDiagnosticInfo is only
accessible via ChainSyncState but its methods are public; either make
SyncDiagnosticInfo's methods non-public (e.g., change pub fn to crate:: or
remove pub on methods) to enforce access via ChainSyncState, or update the
comment text to accurately reflect that SyncDiagnosticInfo exposes public
methods. Locate the struct SyncDiagnosticInfo and its impl block(s) and either
tighten method visibility or revise the top-level comment on the struct to
remove the misleading “access via methods on `ChainSyncState`” statement.

Comment thread crates/domain/src/models/chain_sync_state.rs Outdated
Comment thread crates/p2p/src/block_pool.rs Outdated

@JesseTheRobot JesseTheRobot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM, just a bit of nonblocking feedback around how the diagnostic strings are created from errors

Comment thread crates/p2p/src/gossip_data_handler.rs Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/p2p/src/gossip_data_handler.rs (1)

774-787: ⚠️ Potential issue | 🔴 Critical

Bug: Function returns Ok(()) when handle_execution_payload fails on the final retry.

When handle_execution_payload fails on the 3rd attempt, the code sets last_err but then falls through to return Ok(()); on line 787, incorrectly reporting success instead of failure.

Compare with the peer lookup failure handling (lines 768-771) which correctly returns the error on the final attempt.

🐛 Proposed fix
                     if let Err(e) = self
                         .handle_execution_payload(GossipRequestV2 {
                             peer_id: source_peer_id,
                             miner_address,
                             data: execution_payload,
                         })
                         .await
                     {
                         last_err = Some(e);
                         if attempt < 3 {
                             continue;
                         }
+                        // Fall through to error handling after the loop
+                    } else {
+                        return Ok(());
                     }
-                    return Ok(());
                 }
🤖 Fix all issues with AI agents
In `@crates/p2p/src/gossip_data_handler.rs`:
- Around line 729-736: The peer-not-found branch in the block pull path (inside
the code that calls self.peer_list.get_peer(&source_peer_id) and returns
GossipError::InvalidPeer for block_hash) misses recording diagnostics; add a
call to self.sync_state.record_data_pull_error(...) with the source_peer_id and
block_hash (matching the usage pattern in pull_and_process_block and
pull_and_process_block_from_peer) before logging/returning, or if omission is
intentional, add a brief comment in pull_block_header explaining callers handle
diagnostics so behavior is explicit.

Comment thread crates/p2p/src/gossip_data_handler.rs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/p2p/src/gossip_data_handler.rs (1)

1262-1286: 🧹 Nitpick | 🔵 Trivial

Missing diagnostic recording for consistency with other data-pull operations.

pull_block_body builds a detailed failure_summary but doesn't record it via sync_state.record_data_pull_error(). This is inconsistent with pull_and_add_execution_payload_to_cache (lines 798-803) which records errors after exhausting retries.

🔧 Proposed fix to add diagnostic recording
         } else {
             "Failed to pull block body".to_string()
         };
 
+        self.sync_state.record_data_pull_error(failure_summary.clone());
         Err(GossipError::Internal(InternalGossipError::Unknown(
             failure_summary,
         )))

@JesseTheRobot JesseTheRobot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antouhou antouhou merged commit 9ce78f1 into master Feb 6, 2026
18 checks passed
@antouhou antouhou deleted the feat-sync-diag branch February 6, 2026 14:13
@coderabbitai coderabbitai Bot mentioned this pull request Feb 17, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants