feat: sync diagnostics by antouhou · Pull Request #1117 · Irys-xyz/irys

antouhou · 2026-01-30T15:39:35Z

Describe the changes
This PR adds a sync diagnostic data

Related Issue(s)

Checklist

Tests have been added/updated for the changes.
Documentation has been updated for the changes (if applicable).
The code follows Rust's style guidelines.

Additional Context
Add any other context about the pull request here.

Summary by CodeRabbit

New Features
- Added comprehensive chain synchronization diagnostics: error histories, VDF step tracking, last successful block tracking, and a runtime diagnostic summary.
Improvements
- Enhanced observability: records validation, processing, and data-pull errors and successful block processing.
- Improved logging across block handling, peer pulls, queue timeouts, and VDF execution for easier troubleshooting.

coderabbitai · 2026-01-30T15:39:42Z

📝 Walkthrough

Walkthrough

Threads a diagnostics-enabled ChainSyncState through chain init, BlockTreeService, VDF, and P2P (block pool, gossip, chain sync). Adds SyncDiagnosticInfo storage and recording APIs; code records validation/processing errors, data-pull errors, VDF steps, and successful block processing; diagnostic summaries can be retrieved.

Changes

Cohort / File(s)	Summary
Diagnostic Infrastructure `crates/domain/src/models/chain_sync_state.rs`	Adds `SyncDiagnosticInfo`, `diagnostic_info: Arc<RwLock<...>>` on `ChainSyncState`, timestamped/capped diagnostic storage and formatting, plus public recording APIs for block validation/processing errors, VDF steps, data-pull errors, successful block processing, and summary retrieval.
Actor / Service Init `crates/actors/src/block_tree_service.rs`, `crates/chain/src/chain.rs`	Threads `ChainSyncState` into `BlockTreeService::spawn_service` / `BlockTreeServiceInner` and into VDF thread initialization; initializes new field and records block validation errors in `on_block_validation_finished`.
VDF Execution `crates/vdf/src/vdf.rs`	Adds `chain_sync_state: ChainSyncState` parameter to `run_vdf` and records VDF step numbers via `record_vdf_step(...)`; updates tests and VDF thread call sites to pass `ChainSyncState`.
P2P — Block Processing `crates/p2p/src/block_pool.rs`	Adds calls to `record_block_processing_error(...)` on various block error paths and `record_successful_block_processing(...)` after marking a block processed; control flow otherwise unchanged.
P2P — Gossip & Chain Sync `crates/p2p/src/gossip_data_handler.rs`, `crates/p2p/src/chain_sync.rs`	Gossip handlers now explicitly record data-pull errors to `ChainSyncState` and return on failure; chain_sync augments timeout logging and prints a full diagnostic summary on the final retry.

Sequence Diagram

sequenceDiagram
    participant ChainInit as Chain Init
    participant BTS as BlockTreeService
    participant VDF as VDF Engine
    participant BlockPool as Block Pool
    participant Gossip as Gossip Handler
    participant DiagState as ChainSyncState

    ChainInit->>DiagState: create / clone ChainSyncState
    ChainInit->>BTS: spawn_service(chain_sync_state)
    ChainInit->>VDF: init_vdf_thread(..., chain_sync_state)

    BlockPool->>BlockPool: process_block()
    alt validation fails
        BlockPool->>DiagState: record_block_validation_error(err)
    else forked / processing error
        BlockPool->>DiagState: record_block_processing_error(err)
    else success
        BlockPool->>DiagState: record_successful_block_processing(hash)
    end

    Gossip->>Gossip: pull_and_process_block(...)
    alt pull fails
        Gossip->>DiagState: record_data_pull_error(err)
    else success
        Gossip->>DiagState: record_successful_block_processing(hash)
    end

    VDF->>VDF: run_vdf loop
    VDF->>DiagState: record_vdf_step(step_number)

    BTS->>DiagState: get_diagnostic_summary()
    DiagState-->>BTS: diagnostic summary

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

JesseTheRobot

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: sync diagnostics' is concise and directly describes the main feature being added throughout the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat-sync-diag

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

crates/p2p/src/block_pool.rs (1)

951-964: ⚠️ Potential issue | 🟡 Minor

Classify prevalidation failures as validation errors.

handle_block failures are validation-stage issues; recording them as processing errors will skew diagnostics. Use record_block_validation_error here.

🔧 Suggested fix

-            self.sync_state.record_block_processing_error(format!(
-                "BlockPrevalidationFailed: {:?}",
-                block_discovery_error
-            ));
+            self.sync_state.record_block_validation_error(format!(
+                "BlockPrevalidationFailed: {:?}",
+                block_discovery_error
+            ));

crates/p2p/src/gossip_data_handler.rs (1)

756-805: ⚠️ Potential issue | 🟠 Major

Don’t return Ok on final payload-handling failure; diagnostics won’t run.
If handle_execution_payload fails on the last attempt, the function returns Ok(()), so the new post-loop diagnostics never execute and callers think the payload was cached. The invalid-peer early return also bypasses diagnostics. Consider breaking to the shared error path and using a message that matches all failure causes.

🔧 Suggested fix to propagate and record final-attempt failures

-                        None => {
-                            last_err = Some(GossipError::InvalidPeer(format!(
-                                "Peer not found for peer_id {:?}",
-                                source_peer_id
-                            )));
-                            if attempt < 3 {
-                                continue;
-                            }
-                            return Err(last_err.unwrap());
-                        }
+                        None => {
+                            last_err = Some(GossipError::InvalidPeer(format!(
+                                "Peer not found for peer_id {:?}",
+                                source_peer_id
+                            )));
+                            if attempt < 3 {
+                                continue;
+                            }
+                            break;
+                        }
...
-                    if let Err(e) = self
+                    if let Err(e) = self
                         .handle_execution_payload(GossipRequestV2 {
                             peer_id: source_peer_id,
                             miner_address,
                             data: execution_payload,
                         })
                         .await
                     {
                         last_err = Some(e);
                         if attempt < 3 {
                             continue;
                         }
-                    }
-                    return Ok(());
+                        break;
+                    }
+                    return Ok(());
...
-        self.sync_state.record_data_pull_error(format!(
-            "pull_payload_from_network failed for {}: {:?}",
-            evm_block_hash, err
-        ));
+        self.sync_state.record_data_pull_error(format!(
+            "pull_and_add_execution_payload_to_cache failed for {}: {:?}",
+            evm_block_hash, err
+        ));

🤖 Fix all issues with AI agents

In `@crates/actors/src/block_tree_service.rs`:
- Around line 608-610: Update the diagnostic recorded by
self.chain_sync_state.record_block_validation_error to include the block's hash
alongside validation_error.to_string(); locate the block being validated in the
surrounding function (e.g., the variable representing the Block or BlockHeader
used when validation fails) and format the message to include both the block
hash (via its canonical accessor, e.g., block.hash() or header.hash()) and the
validation error so logs contain "block=<hash> error=<details>" (or similar)
before calling record_block_validation_error.

In `@crates/domain/src/models/chain_sync_state.rs`:
- Around line 99-134: The summary labels hardcode "most recent 10" which will
become inaccurate if the MAX_LAST_BLOCK_VALIDATION_ERRORS constant changes;
update the three label strings near the loops over
self.last_block_validation_errors, self.last_block_processing_errors, and
self.last_data_pull_errors to interpolate or format the count using the
MAX_LAST_BLOCK_VALIDATION_ERRORS constant (e.g. "Last validation errors (most
recent {}):", "Last processing errors (most recent {}):", "Last data pull errors
(most recent {}):") so the message reflects the actual configured max.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@crates/domain/src/models/chain_sync_state.rs`:
- Around line 94-97: The file has formatting differences causing cargo fmt to
fail around the multi-line format! calls (e.g., the output.push_str(&format!(
... self.total_block_validation_errors ... )) usages) and related lines (96,
113, 130); run cargo fmt and reformat the ChainSyncState display/serialization
code so the format! invocations, argument lists, and closing parens follow
rustfmt style (fix spacing/line breaks around output.push_str, format! and the
self.* fields such as self.total_block_validation_errors) and re-check with
cargo fmt to ensure no remaining style diffs.
- Around line 39-69: Replace the fixed-size FIFO Vecs used to store recent
errors with VecDeque to get O(1) front removals: change the types of
last_block_validation_errors, last_block_processing_errors, and
last_data_pull_errors to VecDeque<(String, SystemTime)>, import
std::collections::VecDeque, and update the three functions
record_block_validation_error, record_block_processing_error, and
record_data_pull_error to push_back the new (error, now) and, when len() >
MAX_LAST_BLOCK_VALIDATION_ERRORS, call pop_front() instead of remove(0); no
change needed for record_vdf_step.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@crates/domain/src/models/chain_sync_state.rs`:
- Around line 11-12: The doc comment claims SyncDiagnosticInfo is only
accessible via ChainSyncState but its methods are public; either make
SyncDiagnosticInfo's methods non-public (e.g., change pub fn to crate:: or
remove pub on methods) to enforce access via ChainSyncState, or update the
comment text to accurately reflect that SyncDiagnosticInfo exposes public
methods. Locate the struct SyncDiagnosticInfo and its impl block(s) and either
tighten method visibility or revise the top-level comment on the struct to
remove the misleading “access via methods on `ChainSyncState`” statement.

JesseTheRobot

overall LGTM, just a bit of nonblocking feedback around how the diagnostic strings are created from errors

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/p2p/src/gossip_data_handler.rs (1)
774-787: ⚠️ Potential issue | 🔴 Critical

Bug: Function returns Ok(()) when handle_execution_payload fails on the final retry.

When handle_execution_payload fails on the 3rd attempt, the code sets last_err but then falls through to return Ok(()); on line 787, incorrectly reporting success instead of failure.

Compare with the peer lookup failure handling (lines 768-771) which correctly returns the error on the final attempt.
🐛 Proposed fix
                     if let Err(e) = self
                         .handle_execution_payload(GossipRequestV2 {
                             peer_id: source_peer_id,
                             miner_address,
                             data: execution_payload,
                         })
                         .await
                     {
                         last_err = Some(e);
                         if attempt < 3 {
                             continue;
                         }
+                        // Fall through to error handling after the loop
+                    } else {
+                        return Ok(());
                     }
-                    return Ok(());
                 }

🤖 Fix all issues with AI agents

In `@crates/p2p/src/gossip_data_handler.rs`:
- Around line 729-736: The peer-not-found branch in the block pull path (inside
the code that calls self.peer_list.get_peer(&source_peer_id) and returns
GossipError::InvalidPeer for block_hash) misses recording diagnostics; add a
call to self.sync_state.record_data_pull_error(...) with the source_peer_id and
block_hash (matching the usage pattern in pull_and_process_block and
pull_and_process_block_from_peer) before logging/returning, or if omission is
intentional, add a brief comment in pull_block_header explaining callers handle
diagnostics so behavior is explicit.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/p2p/src/gossip_data_handler.rs (1)
1262-1286: 🧹 Nitpick | 🔵 Trivial

Missing diagnostic recording for consistency with other data-pull operations.

pull_block_body builds a detailed failure_summary but doesn't record it via sync_state.record_data_pull_error(). This is inconsistent with pull_and_add_execution_payload_to_cache (lines 798-803) which records errors after exhausting retries.
🔧 Proposed fix to add diagnostic recording
         } else {
             "Failed to pull block body".to_string()
         };
 
+        self.sync_state.record_data_pull_error(failure_summary.clone());
         Err(GossipError::Internal(InternalGossipError::Unknown(
             failure_summary,
         )))

JesseTheRobot

LGTM

feat: sync diagnostics

0e93532

reformat

ebbfe26

antouhou marked this pull request as ready for review January 30, 2026 16:02

Merge branch 'master' into feat-sync-diag

d1fe1a5

coderabbitai Bot reviewed Jan 30, 2026

View reviewed changes

Comment thread crates/actors/src/block_tree_service.rs Outdated

Comment thread crates/domain/src/models/chain_sync_state.rs Outdated

antouhou added 2 commits January 30, 2026 15:25

fix coderabiit issues

f1550d3

Merge remote-tracking branch 'origin/feat-sync-diag' into feat-sync-diag

e055f9f

coderabbitai Bot reviewed Jan 30, 2026

View reviewed changes

Comment thread crates/domain/src/models/chain_sync_state.rs

Comment thread crates/domain/src/models/chain_sync_state.rs

reformat

07d7c3a

coderabbitai Bot reviewed Jan 30, 2026

View reviewed changes

Comment thread crates/domain/src/models/chain_sync_state.rs Outdated

JesseTheRobot reviewed Feb 3, 2026

View reviewed changes

Comment thread crates/p2p/src/block_pool.rs Outdated

JesseTheRobot approved these changes Feb 3, 2026

View reviewed changes

antouhou added 4 commits February 3, 2026 12:19

rabbit 1

009fe46

rabbit 2

702e07c

Jesse 1

e8decfa

reformat

d9c7e33

JesseTheRobot reviewed Feb 3, 2026

View reviewed changes

Comment thread crates/p2p/src/gossip_data_handler.rs Outdated

better error messages

1a63dfd

coderabbitai Bot reviewed Feb 4, 2026

View reviewed changes

Comment thread crates/p2p/src/gossip_data_handler.rs

antouhou added 2 commits February 4, 2026 12:57

fix rabbit

f27c233

Merge branch 'master' into feat-sync-diag

ab79a5e

coderabbitai Bot reviewed Feb 4, 2026

View reviewed changes

JesseTheRobot approved these changes Feb 6, 2026

View reviewed changes

antouhou merged commit 9ce78f1 into master Feb 6, 2026
18 checks passed

antouhou deleted the feat-sync-diag branch February 6, 2026 14:13

This was referenced Feb 9, 2026

feat: sealed blocks #1097

Merged

feat: better sync crash heuristics #1132

Merged

refactor: propagate an error if block tree rejects a block #1137

Merged

coderabbitai Bot mentioned this pull request Feb 17, 2026

feat: pd chunks endpoint #1149

Open

3 tasks

This was referenced Feb 17, 2026

refactor: migration lifecycle simplification #1136

Merged

test: stability #1142

Merged

chore: sync cleanup #1157

Merged

This was referenced Mar 13, 2026

test: add additional tests #1206

Merged

perf(vdf): optimize hot loop with direct compress256 and copy elimination #1214

Merged

feat: run mode #1228

Merged

coderabbitai Bot mentioned this pull request Mar 20, 2026

feat(multiversion-tests): add cross-version integration test harness #1207

Merged

3 tasks

coderabbitai Bot mentioned this pull request Apr 2, 2026

PD: Chain re-sync strategy when perm ledger data has expired #1381

Open

3 tasks

This was referenced Apr 16, 2026

Track deferred ProcessPendingChunks waiters to ensure shutdown quiescenc #1395

Open

VDF mining solution lag under chunk-ingress CPU pressure #1399

Open

This was referenced May 8, 2026

fix(p2p, validation): block-pool refcount and validation helper extraction #1408

Closed

feat: VDF stall detection logic tweaks #1413

Merged

feat: preval perf #1414

Merged

fix: block pool race #1417

Merged

fix: validation error disambiguation #1416

Merged

coderabbitai Bot mentioned this pull request Jun 15, 2026

fix(vdf): limit run-ahead via reset-seed confirmation gate #1449

Open

3 tasks

Conversation

antouhou commented Jan 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JesseTheRobot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

JesseTheRobot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

antouhou commented Jan 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 30, 2026 •

edited

Loading