Skip to content

feat: Network partition recovery: deep reorg support + integration tests#1405

Merged
DanMacDonald merged 17 commits into
masterfrom
dmac/network-partition-recovery
Jun 3, 2026
Merged

feat: Network partition recovery: deep reorg support + integration tests#1405
DanMacDonald merged 17 commits into
masterfrom
dmac/network-partition-recovery

Conversation

@DanMacDonald

@DanMacDonald DanMacDonald commented May 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

~295 lines of protocol fixes, ~1,470 lines of test code (1,765 total)

Protocol fixes

  • recover_from_network_partition: When a reorg exceeds block_migration_depth, rolls back migrated state — clears orphaned storage module indexes, truncates the block index to the fork point, rolls back supply state — then re-migrates the winning fork
  • OneYear/ThirtyDay storage mapping: map_storage_modules_to_partition_assignments() now forwards OneYear and ThirtyDay partition assignments to StorageModuleService, and on_block_migrated() extracts their transactions for chunk migration
  • Stale migration guard: on_block_migrated checks block-index membership before processing, preventing fire-and-forget BlockMigrated messages from orphaned forks from re-writing cleared data after recovery
  • Deep reorg logging: ERROR-level alert with structured fields (fork_depth, new_fork_depth, fork_height, current_height) when a reorg crosses the migration boundary, with actionable guidance to investigate peer connectivity

Block pool fix

  • The block pool rejected competing blocks at already-indexed heights as PartOfAPrunedFork, preventing gossip-based delivery during deep reorgs. Now blocks within block_tree_depth of the latest index are allowed through as reorg candidates, letting the block tree (which is authoritative) decide via recover_from_network_partition

Integration tests

  • heavy4_network_partition_recovery — verifies the full recovery flow using real gossip delivery: builds a shared base chain, creates divergent forks, gossips the longer fork, and asserts shared data is preserved, orphaned data is cleared, winning fork data is re-indexed, supply state is correct, block index contains winning fork blocks with valid chain linkage across the recovery boundary, and the node continues mining with consistent supply state after recovery
  • heavy4_partition_recovery_epoch_boundary — verifies partition assignment rollback when a reorg crosses an epoch boundary with different pledge commitments on each fork. Two nodes each post a unique pledge on their fork and mine through an epoch boundary. After reorg, verifies commitment state, assignment counts (6→5, 1→2), storage module reassignment, and Entropy intervals proving re-pack executed
  • heavy4_partition_recovery_multi_epoch — exercises a reorg crossing 2 epoch boundaries (epochs at heights 4 and 6). Each node posts 2 pledges across both epochs. After reorg, verifies epoch_height==6 (final epoch on winning fork), both genesis fork pledges absent, both peer fork pledges present with partition_hashes, assignment counts revert (7→5, 1→3), and system continues mining through the next epoch boundary

Each test has a distinct focus with minimal overlap: partition_recovery owns the data layer and deep recovery path, epoch_boundary owns storage module re-packing verification, and multi_epoch owns multi-epoch snapshot correctness.

Test plan

  • cargo nextest run -p irys-chain-tests heavy4_network_partition_recovery — passes (~27s)
  • cargo nextest run -p irys-chain-tests heavy4_partition_recovery_epoch_boundary — passes (~20s)
  • cargo nextest run -p irys-chain-tests heavy4_partition_recovery_multi_epoch — passes (~30s)
  • All 21 block_pool unit tests pass
  • All 13 fork_recovery/cascade_reorg tests pass (no regressions)
  • cargo fmt and cargo clippy clean

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Nodes can perform surgical recovery after deep network partitions: roll back and reconcile block index, storage mappings, and supply to match the adopted chain.
    • Reorg handling improved to trigger recovery for deep forks and preserve canonical shared data.
  • Bug Fixes

    • Prevents applying stale migration events during reorgs; migration now follows block-encoded ledger ordering.
  • Tests

    • Added end-to-end partition recovery tests, including epoch-boundary and multi-epoch scenarios.

Review Change Stack

…grate their chunks

map_storage_modules_to_partition_assignments() only processed Publish,
Submit, and Capacity partitions — OneYear and ThirtyDay assignments
existed in the epoch snapshot but were never forwarded to the
StorageModuleService.

on_block_migrated() only extracted Submit and Publish ledger transactions
during chunk migration, so OneYear and ThirtyDay chunks were never
written to storage modules.
When a reorg exceeds block_migration_depth, blocks have already been
migrated to the block index with chunks assigned to storage module
partitions. recover_from_network_partition() handles this by:

1. Collecting orphaned block chunk ranges, tx data_roots, and rewards
2. Clearing ChunkPathHashesByOffset and DataRootInfosByDataRoot entries
   for orphaned offsets/transactions only (preserving common ancestor data)
3. Marking orphaned offsets as Uninitialized so repacking can proceed
4. Truncating the block index to the fork point
5. Rolling back supply state cumulative_emitted

The normal migration path then re-processes the new canonical chain.
The block pool previously rejected all blocks at already-indexed heights
with different hashes as PartOfAPrunedFork. This prevented gossip-based
delivery of competing fork blocks during deep reorgs, forcing callers to
bypass the block pool entirely via send_full_block.

The block tree already handles deep reorgs correctly via
recover_from_network_partition — the block pool was the only barrier.

Change block_status() to classify blocks at indexed heights with
different hashes based on proximity to the latest index:
- In the tree already → ProcessedButCanBeReorganized (fork in progress)
- Within block_tree_depth → NotProcessed (potential deep reorg candidate)
- Beyond block_tree_depth → PartOfAPrunedFork (DoS protection)

This resolves the TODO: "this needs to handle migrated block reorgs".
Replace send_full_block with gossip_block_to_peers for delivering the
peer's fork blocks to genesis. This exercises the real P2P code path
that production nodes use when reconnecting after a network partition:
gossip header → pull block body → pull ETH payload → validate → reorg.

Remove the manual data tx/chunk pre-ingestion since the gossip pull
mechanism handles block body (including tx headers) automatically.
@coderabbitai

coderabbitai Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds network-partition recovery: actors accept a storage-modules guard, BlockTree triggers BlockMigrationService recovery on deep reorgs, chunk migration gains staleness guards, block-status logic refines reorg classification, storage/supply/index APIs expose rollback helpers, and multi-node tests validate surgical recovery.

Changes

Network Partition Recovery

Layer / File(s) Summary
Actor messages, reorg wiring & startup
crates/actors/src/block_tree_service.rs, crates/chain/src/chain.rs
Adds BlockTreeServiceMessage::SetStorageModulesGuard(StorageModulesReadGuard), forwards guard to BlockMigrationService::set_storage_modules_guard(), sends guard during init, and invokes recover_from_network_partition() when reorg depth exceeds migration threshold.
Chunk migration safety
crates/actors/src/chunk_migration_service.rs
on_block_migrated() now verifies the incoming block is still canonical before migrating and iterates block.data_ledgers deterministically, skipping unknown/missing ledgers.
Block status decision & tests
crates/p2p/src/block_status_provider.rs, crates/p2p/src/tests/*
Adds max_reorg_depth from block_tree_depth, refines mismatch handling to return ProcessedButCanBeReorganized/NotProcessed/PartOfAPrunedFork, updates constructor signature, and adjusts tests to pass a smaller block_tree_depth.
Integration tests: partition recovery
crates/chain-tests/src/multi_node/partition_recovery*.rs, crates/chain-tests/src/multi_node/mod.rs
Adds three heavy multi-node Tokio tests exercising network partition, epoch-boundary, and multi-epoch reorg recovery; includes helpers to post chunked txs and assert storage-module, supply, and block-index state.
Storage, supply, and block-index APIs
crates/domain/src/models/storage_module.rs, crates/domain/src/models/supply_state.rs, crates/domain/src/models/block_index.rs
Exposes StorageModule interval APIs, makes range-conversion and interval mutation methods public, adds SupplyState::rollback_reward(), and adds BlockIndex::truncate_to_height() for DB-backed truncation used during recovery.

Sequence Diagram

sequenceDiagram
    actor Node as Node (Genesis)
    participant BT as BlockTreeService
    participant BM as BlockMigrationService
    participant CM as ChunkMigrationService
    participant SM as StorageModule
    participant BI as BlockIndex
    participant SS as SupplyState

    Node->>BT: SetStorageModulesGuard(guard)
    BT->>BM: set_storage_modules_guard(guard)
    BM->>BM: store guard

    Node->>BT: Process reorg (old_fork_len > migration_depth)
    BT->>BM: recover_from_network_partition(fork_height)

    BM->>SM: gather orphaned ledger ranges & orphaned data_roots
    BM->>SM: clear offset/index mappings for orphaned txs
    BM->>SM: cut affected chunk intervals -> mark Uninitialized
    SM->>SM: write_intervals_to_submodules()

    BM->>BI: truncate_to_height(fork_parent)
    BI->>BI: delete_block_index_range(fork_parent+1..)

    BM->>SS: rollback_reward(fork_parent, reward_sum)
    SS->>SS: decrement cumulative_emitted

    BM-->>BT: Return success
    BT->>BT: Continue reorg metrics & broadcast
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • Irys-xyz/irys#1143: Modifies reorg handling in BlockTreeService; related to recovery integration.
  • Irys-xyz/irys#1417: Touches BlockStatusProvider and block-pool gating logic; overlaps block-status decision changes.
  • Irys-xyz/irys#1104: Adds DB helper delete_block_index_range used by BlockIndex::truncate_to_height.

Suggested reviewers

  • JesseTheRobot
  • glottologist
  • antouhou
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: network partition recovery with deep reorg support and integration tests, which aligns with the primary objectives of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dmac/network-partition-recovery

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@DanMacDonald DanMacDonald changed the title Network partition recovery: deep reorg support + integration test feat: Network partition recovery: deep reorg support + integration test May 1, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/domain/src/models/storage_module.rs (1)

1631-1641: ⚠️ Potential issue | 🟠 Major

Compute intersection before subtraction to prevent underflow on overlapping ranges.

make_range_partition_relative performs unchecked subtraction that can underflow when callers pass ranges that extend before the storage module's range. Although get_overlapped_storage_modules filters for overlapping modules, overlap does not guarantee the input range starts within the module's bounds (e.g., range [50, 150] overlaps module [100, 200], but 50 - 100 underflows). This causes a panic in debug builds and silent wrapping in release builds.

The index_transaction_data method already demonstrates the correct pattern: compute the intersection before conversion. Apply the same approach here.

Proposed fix
 pub fn make_range_partition_relative(
     &self,
     chunk_range: LedgerChunkRange,
 ) -> eyre::Result<PartitionChunkRange> {
     let storage_module_range = self.get_storage_module_ledger_offsets()?;
-    let start = chunk_range.start() - storage_module_range.start();
-    let end = chunk_range.end() - storage_module_range.start();
+    let overlap = storage_module_range
+        .intersection(&chunk_range)
+        .ok_or_else(|| eyre::eyre!("chunk range does not overlap storage module range"))?;
+    let start = overlap.start() - storage_module_range.start();
+    let end = overlap.end() - storage_module_range.start();
     Ok(PartitionChunkRange(ii(
         PartitionChunkOffset::from(start),
         PartitionChunkOffset::from(end),
     )))
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/domain/src/models/storage_module.rs` around lines 1631 - 1641, The
function make_range_partition_relative currently subtracts storage offsets from
chunk_range directly which can underflow; change it to first compute the
intersection between chunk_range and the storage module range (using the same
pattern as index_transaction_data, e.g. let intersection =
chunk_range.intersection(&storage_module_range) or the project’s equivalent),
return an error if there is no intersection, then compute relative offsets from
that intersection (let start = intersection.start() -
storage_module_range.start(); let end = intersection.end() -
storage_module_range.start()) and construct the PartitionChunkRange from those
relative values; keep references to get_storage_module_ledger_offsets and
make_range_partition_relative to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/actors/src/block_migration_service.rs`:
- Line 163: Replace the unchecked cast "rollback_count as usize" with a checked
conversion using TryFrom/TryInto: compute a capacity via "let capacity =
usize::try_from(rollback_count)?;" (or "let capacity =
rollback_count.try_into()?;" if using std::convert::TryInto) and then call
"Vec::with_capacity(capacity)"; if the surrounding function doesn't return
Result, propagate or handle the conversion error appropriately (e.g., map it
into the function's error type and return early). This change touches the
"rollback_count" conversion and the "rollback_infos" initialization
(Vec::with_capacity).

In `@crates/actors/src/chunk_migration_service.rs`:
- Around line 169-183: The preflight canonicality check using
block_index.get_item in the chunk migration handler is vulnerable to TOCTOU with
recover_from_network_partition; add a shared serialization point between
recovery and migration (either a migration generation/epoch or a shared lock)
and validate that generation/lock after the check and immediately before any
writes so stale BlockMigrated messages cannot re-create cleared state.
Concretely: introduce a migration_epoch (or use an existing RecoveryManager)
that is incremented by recover_from_network_partition, read the epoch in the
handler (alongside the block_index.get_item check) and re-check the epoch
immediately before performing index/chunk writes (or acquire the same
Mutex/RwLock used by recovery around validation+write). Update the chunk
migration code paths (the BlockMigrated handler and any functions that mutate
indexes/chunks) to abort when the epoch changed or lock cannot be held.

In `@crates/chain-tests/src/multi_node/partition_recovery.rs`:
- Around line 405-409: The verification vector uses the wrong ledger for
peer_publish_tx; replace DataLedger::Submit with DataLedger::Publish in the
peer_unique_txs definition so the tuple for peer_publish_tx matches the ledger
it was posted to (peer_publish_tx, DataLedger::Publish), keeping the other
entries (peer_one_year_tx -> DataLedger::OneYear, peer_thirty_day_tx ->
DataLedger::ThirtyDay) unchanged.
- Around line 342-346: The tuple list shared_txs uses the wrong ledger for
shared_publish_tx (it currently pairs shared_publish_tx with
DataLedger::Submit); update the tuple so shared_publish_tx is paired with
DataLedger::Publish (i.e., change the first tuple to (DataLedger::Publish,
&shared_publish_tx)) so the verification code (using shared_txs) queries the
same ledger that created the transaction.

---

Outside diff comments:
In `@crates/domain/src/models/storage_module.rs`:
- Around line 1631-1641: The function make_range_partition_relative currently
subtracts storage offsets from chunk_range directly which can underflow; change
it to first compute the intersection between chunk_range and the storage module
range (using the same pattern as index_transaction_data, e.g. let intersection =
chunk_range.intersection(&storage_module_range) or the project’s equivalent),
return an error if there is no intersection, then compute relative offsets from
that intersection (let start = intersection.start() -
storage_module_range.start(); let end = intersection.end() -
storage_module_range.start()) and construct the PartitionChunkRange from those
relative values; keep references to get_storage_module_ledger_offsets and
make_range_partition_relative to locate the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 921cee82-83fd-4bbb-b302-904d181b8b15

📥 Commits

Reviewing files that changed from the base of the PR and between fd107a5 and 7dd9994.

📒 Files selected for processing (13)
  • crates/actors/src/block_migration_service.rs
  • crates/actors/src/block_tree_service.rs
  • crates/actors/src/chunk_migration_service.rs
  • crates/chain-tests/src/multi_node/mod.rs
  • crates/chain-tests/src/multi_node/partition_recovery.rs
  • crates/chain/src/chain.rs
  • crates/domain/src/models/block_index.rs
  • crates/domain/src/models/storage_module.rs
  • crates/domain/src/models/supply_state.rs
  • crates/domain/src/snapshots/epoch_snapshot/epoch_snapshot.rs
  • crates/p2p/src/block_status_provider.rs
  • crates/p2p/src/tests/block_pool/mod.rs
  • crates/p2p/src/tests/util.rs

Comment thread crates/actors/src/block_migration_service.rs
Comment thread crates/actors/src/chunk_migration_service.rs
Comment thread crates/chain-tests/src/multi_node/partition_recovery.rs
Comment thread crates/chain-tests/src/multi_node/partition_recovery.rs
@github-actions

github-actions Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Adds heavy4_partition_recovery_epoch_boundary which verifies that
partition assignments are correctly rolled back when a reorg crosses
an epoch boundary where each fork processed different pledge
commitments. Tracks the specific affected storage module through the
reorg and asserts it gets reassigned to a canonical partition with
Entropy intervals from re-packing.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/chain-tests/src/multi_node/partition_recovery_epoch_boundary.rs`:
- Around line 26-29: Change this test to start from the shared testing builders:
call NodeConfig::testing() to create genesis_config (instead of
NodeConfig::testing_with_epochs) and then set the epoch length on its consensus
config via consensus = ConsensusConfig::testing() with the epoch length override
(num_blocks_in_epoch) and set chunk_size explicitly on
genesis_config.consensus.get_mut().chunk_size = 32; in short, replace the
testing_with_epochs path by creating genesis_config = NodeConfig::testing(),
build or derive a ConsensusConfig::testing() with the desired epoch length,
assign/merge it into genesis_config.consensus, and then set chunk_size on
genesis_config.consensus.get_mut().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8a8c8ddf-50e3-4fdf-a723-0ec66fd70321

📥 Commits

Reviewing files that changed from the base of the PR and between 7dd9994 and 2ade39d.

📒 Files selected for processing (2)
  • crates/chain-tests/src/multi_node/mod.rs
  • crates/chain-tests/src/multi_node/partition_recovery_epoch_boundary.rs

Operators and monitoring agents need a clear signal when a node recovers
from a network partition. This adds an ERROR log with structured fields
(fork_depth, new_fork_depth, fork_height, current_height) and actionable
guidance to investigate peer connectivity.
Two new tests based on Codex security review findings:

- partition_recovery_multi_epoch: reorg crossing 2 epoch boundaries,
  verifies epoch snapshot correctness, partition assignment rollback
  across multiple epochs, and continued mining after reorg.

- partition_recovery_deep: deep reorg triggering recover_from_network_partition
  with block_migration_depth=1, verifies block index consistency, supply
  state rollback/re-migration, chain linkage across recovery boundary,
  and continued block production.
Remove the redundant partition_recovery_deep test and add its unique
checks (block index hash verification, chain linkage across recovery
boundary, continued mining + supply state after recovery) to the
existing heavy4_network_partition_recovery test.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/actors/src/block_tree_service.rs`:
- Around line 685-698: The code currently casts usize -> u32 with `as` for
old_fork_blocks.len() and new_fork_blocks.len(); replace those with checked
conversions using TryFrom/TryInto (e.g. `let old_len: u32 =
old_fork_blocks.len().try_into()?; let new_len: u32 =
new_fork_blocks.len().try_into()?;`) and then use `old_len`/`new_len` in the
error! call and the subtraction; ensure you handle the conversion error (map to
a sensible error return or log and early-return) so the function’s error type or
control flow remains correct; update references to
`self.config.consensus.block_migration_depth` comparisons to use the converted
values or convert that config value to the same type before comparing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0f69e9db-93a6-45a5-9420-e4a1842d8877

📥 Commits

Reviewing files that changed from the base of the PR and between 2ade39d and 242833d.

📒 Files selected for processing (5)
  • crates/actors/src/block_tree_service.rs
  • crates/chain-tests/src/multi_node/mod.rs
  • crates/chain-tests/src/multi_node/partition_recovery_deep.rs
  • crates/chain-tests/src/multi_node/partition_recovery_epoch_boundary.rs
  • crates/chain-tests/src/multi_node/partition_recovery_multi_epoch.rs

Comment on lines +685 to +698
if old_fork_blocks.len() as u32 > self.config.consensus.block_migration_depth {
error!(
fork_depth = old_fork_blocks.len(),
new_fork_depth = new_fork_blocks.len(),
fork_height,
current_height = arc_block.height,
"NETWORK PARTITION RECOVERY: this node was isolated from the network \
and fell behind the canonical chain. Deep reorg rolling back {} \
migrated blocks. Investigate peer connectivity — check firewall \
rules, network routes, and gossip health between this node and \
its peers.",
old_fork_blocks.len() as u32
- self.config.consensus.block_migration_depth,
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

Use checked conversions for runtime usize → u32 casts.

Per codebase conventions, production code should use try_from/try_into for non-literal numeric casts rather than as. While overflow is unlikely in practice (would require billions of fork blocks), using checked conversion is defensive and consistent with the established policy.

♻️ Proposed fix
-                    if old_fork_blocks.len() as u32 > self.config.consensus.block_migration_depth {
+                    let old_fork_len = u32::try_from(old_fork_blocks.len()).unwrap_or(u32::MAX);
+                    if old_fork_len > self.config.consensus.block_migration_depth {
                         error!(
-                            fork_depth = old_fork_blocks.len(),
+                            fork_depth = old_fork_len,
                             new_fork_depth = new_fork_blocks.len(),
                             fork_height,
                             current_height = arc_block.height,
                             "NETWORK PARTITION RECOVERY: this node was isolated from the network \
                              and fell behind the canonical chain. Deep reorg rolling back {} \
                              migrated blocks. Investigate peer connectivity — check firewall \
                              rules, network routes, and gossip health between this node and \
                              its peers.",
-                            old_fork_blocks.len() as u32
-                                - self.config.consensus.block_migration_depth,
+                            old_fork_len.saturating_sub(self.config.consensus.block_migration_depth),
                         );

Based on learnings: "In this codebase's production Rust code, do not use as numeric casts... Prefer checked conversions."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if old_fork_blocks.len() as u32 > self.config.consensus.block_migration_depth {
error!(
fork_depth = old_fork_blocks.len(),
new_fork_depth = new_fork_blocks.len(),
fork_height,
current_height = arc_block.height,
"NETWORK PARTITION RECOVERY: this node was isolated from the network \
and fell behind the canonical chain. Deep reorg rolling back {} \
migrated blocks. Investigate peer connectivity — check firewall \
rules, network routes, and gossip health between this node and \
its peers.",
old_fork_blocks.len() as u32
- self.config.consensus.block_migration_depth,
);
let old_fork_len = u32::try_from(old_fork_blocks.len()).unwrap_or(u32::MAX);
if old_fork_len > self.config.consensus.block_migration_depth {
error!(
fork_depth = old_fork_len,
new_fork_depth = new_fork_blocks.len(),
fork_height,
current_height = arc_block.height,
"NETWORK PARTITION RECOVERY: this node was isolated from the network \
and fell behind the canonical chain. Deep reorg rolling back {} \
migrated blocks. Investigate peer connectivity — check firewall \
rules, network routes, and gossip health between this node and \
its peers.",
old_fork_len.saturating_sub(self.config.consensus.block_migration_depth),
);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/actors/src/block_tree_service.rs` around lines 685 - 698, The code
currently casts usize -> u32 with `as` for old_fork_blocks.len() and
new_fork_blocks.len(); replace those with checked conversions using
TryFrom/TryInto (e.g. `let old_len: u32 = old_fork_blocks.len().try_into()?; let
new_len: u32 = new_fork_blocks.len().try_into()?;`) and then use
`old_len`/`new_len` in the error! call and the subtraction; ensure you handle
the conversion error (map to a sensible error return or log and early-return) so
the function’s error type or control flow remains correct; update references to
`self.config.consensus.block_migration_depth` comparisons to use the converted
values or convert that config value to the same type before comparing.

@DanMacDonald DanMacDonald changed the title feat: Network partition recovery: deep reorg support + integration test feat: Network partition recovery: deep reorg support + integration tests May 3, 2026
…ministic iteration

- block_tree_service: keep network partition recovery path over master's
  controlled-shutdown approach
- chunk_migration_service: keep stale-message guard from our branch,
  adopt master's deterministic block.data_ledgers iteration order

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/chain-tests/src/multi_node/partition_recovery.rs`:
- Around line 185-191: The test discards the return from post_tx_with_chunks
when posting genesis_publish_chunks so the orphaned Publish/Submit data_root
from the genesis minority fork is never checked; capture and save the result of
post_tx_with_chunks(&genesis, &genesis_signer, &genesis_publish_chunks,
DataLedger::Publish).await? into a variable (e.g., genesis_publish_tx) and add
that variable to the genesis_orphaned_txs collection alongside
genesis_one_year_tx and genesis_thirty_day_tx so the data_root from the genesis
Publish path is asserted cleared during recovery (mirror the existing
winning-fork re-index checks).
- Around line 458-478: After the two mine_block().await? calls in the
continued-operation phase, wait for the block index migration to complete before
reading genesis.get_block_index_height(): call
genesis.wait_until_block_index_height(peer_height + 2). Then read
final_index_height and proceed with the supply checks (use the same
genesis.get_block_index_height(), final_supply via
genesis.node_ctx.supply_state_guard.get(), and the existing loop over block
heights) so the assertions won't race with asynchronous BlockMigrated
processing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 82a53460-121a-4d23-b475-fad8f49ebe01

📥 Commits

Reviewing files that changed from the base of the PR and between 242833d and 725de6f.

📒 Files selected for processing (6)
  • crates/actors/src/block_tree_service.rs
  • crates/actors/src/chunk_migration_service.rs
  • crates/chain-tests/src/multi_node/mod.rs
  • crates/chain-tests/src/multi_node/partition_recovery.rs
  • crates/chain/src/chain.rs
  • crates/p2p/src/tests/util.rs

Comment on lines +185 to +191
post_tx_with_chunks(
&genesis,
&genesis_signer,
&genesis_publish_chunks,
DataLedger::Publish,
)
.await?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Orphaned genesis Publish tx data_root not verified for clearing — test coverage gap.

The post_tx_with_chunks result for genesis_publish_chunks (line 185) is discarded. Because the variable is absent, genesis_orphaned_txs (lines 409–411) only asserts that genesis_one_year_tx and genesis_thirty_day_tx data_roots were cleared after recovery. The orphaned Publish (and transitively Submit) data_root written by genesis's minority fork is never checked.

The comment at lines 422–424 explains why the interval check is skipped, but there is no analogous note here. Saving the return value and including it in genesis_orphaned_txs would close the gap and make the assertion symmetrical with the winning-fork re-index check (line 427).

🐛 Proposed fix
-    post_tx_with_chunks(
+    let genesis_publish_tx = post_tx_with_chunks(
         &genesis,
         &genesis_signer,
         &genesis_publish_chunks,
         DataLedger::Publish,
     )
     .await?;
     let genesis_orphaned_txs: Vec<(DataLedger, &DataTransaction)> = vec![
+        (DataLedger::Submit, &genesis_publish_tx),
         (DataLedger::OneYear, &genesis_one_year_tx),
         (DataLedger::ThirtyDay, &genesis_thirty_day_tx),
     ];

Also applies to: 408-420

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/chain-tests/src/multi_node/partition_recovery.rs` around lines 185 -
191, The test discards the return from post_tx_with_chunks when posting
genesis_publish_chunks so the orphaned Publish/Submit data_root from the genesis
minority fork is never checked; capture and save the result of
post_tx_with_chunks(&genesis, &genesis_signer, &genesis_publish_chunks,
DataLedger::Publish).await? into a variable (e.g., genesis_publish_tx) and add
that variable to the genesis_orphaned_txs collection alongside
genesis_one_year_tx and genesis_thirty_day_tx so the data_root from the genesis
Publish path is asserted cleared during recovery (mirror the existing
winning-fork re-index checks).

Comment on lines +458 to +478
let final_index_height = genesis.get_block_index_height();
assert!(
final_index_height > peer_height,
"Block index should continue advancing after recovery, got {final_index_height}"
);

let final_supply = genesis
.node_ctx
.supply_state_guard
.as_ref()
.expect("supply state not set")
.get();
let mut final_expected = irys_types::U256::zero();
for h in 1..=final_index_height {
let block = genesis.get_block_by_height_from_index(h, false)?;
final_expected = final_expected.saturating_add(block.reward_amount);
}
assert_eq!(
final_supply.cumulative_emitted, final_expected,
"Supply state should remain consistent after continued mining"
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Missing wait_until_block_index_height before reading final_index_height — potential spurious failure.

Every earlier stage in this test correctly calls wait_until_block_index_height before reading or asserting on the block index (lines 137–138, 261–263, 311–313). The continued-operation phase breaks this pattern: genesis.get_block_index_height() is read at line 458 immediately after two mine_block().await? calls, with no wait for migration.

Block migration fires asynchronously (delayed by block_migration_depth confirmations). After mining blocks 10 and 11, mine_block().await? returns as soon as the chain tip advances — the BlockMigrated event for block 10 (which would push the index from 8 → 9 → 10) may not have been processed yet. If the index is still at 8 or 9 when line 458 executes:

  • assert!(final_index_height > peer_height) (> 9) will fail spuriously.
  • The supply loop for h in 1..=final_index_height will compute a smaller sum than final_supply.cumulative_emitted (which was already updated), failing the assert_eq! at line 475.
🐛 Proposed fix
     genesis.mine_block().await?;
     peer.wait_until_height(peer_height + 2, seconds_to_wait)
         .await?;
+    // Block migration is asynchronous; wait for the index to advance before reading.
+    let expected_final_index = peer_height + 1; // block 10 migrated after block 11 confirms it
+    genesis
+        .wait_until_block_index_height(expected_final_index, seconds_to_wait)
+        .await?;

     let final_index_height = genesis.get_block_index_height();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/chain-tests/src/multi_node/partition_recovery.rs` around lines 458 -
478, After the two mine_block().await? calls in the continued-operation phase,
wait for the block index migration to complete before reading
genesis.get_block_index_height(): call
genesis.wait_until_block_index_height(peer_height + 2). Then read
final_index_height and proceed with the supply checks (use the same
genesis.get_block_index_height(), final_supply via
genesis.node_ctx.supply_state_guard.get(), and the existing loop over block
heights) so the assertions won't race with asynchronous BlockMigrated
processing.

Single conflict in block_status_provider.rs:
- Kept our doc comment for PartOfAPrunedFork (describes deep reorg semantics)
- Took master's granular ChainState match for non-indexed blocks (block pool race fix #1417)
- Inlined is_block_in_the_tree() call for deep reorg check (replaces removed local variable)

Merges: MDBX metrics, VDF progress/stall detection, preval perf,
efficient-sampling, block pool race fix, Axiom removal, body-pull
circuit-breaker fix, metrics followup.
…ion-recovery

# Conflicts:
#	crates/actors/src/block_migration_service.rs
@DanMacDonald DanMacDonald merged commit b099906 into master Jun 3, 2026
18 checks passed
@DanMacDonald DanMacDonald deleted the dmac/network-partition-recovery branch June 3, 2026 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant