feat: Network partition recovery: deep reorg support + integration tests by DanMacDonald · Pull Request #1405 · Irys-xyz/irys

DanMacDonald · 2026-05-01T16:27:55Z

Summary

~295 lines of protocol fixes, ~1,470 lines of test code (1,765 total)

Protocol fixes

recover_from_network_partition: When a reorg exceeds block_migration_depth, rolls back migrated state — clears orphaned storage module indexes, truncates the block index to the fork point, rolls back supply state — then re-migrates the winning fork
OneYear/ThirtyDay storage mapping: map_storage_modules_to_partition_assignments() now forwards OneYear and ThirtyDay partition assignments to StorageModuleService, and on_block_migrated() extracts their transactions for chunk migration
Stale migration guard: on_block_migrated checks block-index membership before processing, preventing fire-and-forget BlockMigrated messages from orphaned forks from re-writing cleared data after recovery
Deep reorg logging: ERROR-level alert with structured fields (fork_depth, new_fork_depth, fork_height, current_height) when a reorg crosses the migration boundary, with actionable guidance to investigate peer connectivity

Block pool fix

The block pool rejected competing blocks at already-indexed heights as PartOfAPrunedFork, preventing gossip-based delivery during deep reorgs. Now blocks within block_tree_depth of the latest index are allowed through as reorg candidates, letting the block tree (which is authoritative) decide via recover_from_network_partition

Integration tests

heavy4_network_partition_recovery — verifies the full recovery flow using real gossip delivery: builds a shared base chain, creates divergent forks, gossips the longer fork, and asserts shared data is preserved, orphaned data is cleared, winning fork data is re-indexed, supply state is correct, block index contains winning fork blocks with valid chain linkage across the recovery boundary, and the node continues mining with consistent supply state after recovery
heavy4_partition_recovery_epoch_boundary — verifies partition assignment rollback when a reorg crosses an epoch boundary with different pledge commitments on each fork. Two nodes each post a unique pledge on their fork and mine through an epoch boundary. After reorg, verifies commitment state, assignment counts (6→5, 1→2), storage module reassignment, and Entropy intervals proving re-pack executed
heavy4_partition_recovery_multi_epoch — exercises a reorg crossing 2 epoch boundaries (epochs at heights 4 and 6). Each node posts 2 pledges across both epochs. After reorg, verifies epoch_height==6 (final epoch on winning fork), both genesis fork pledges absent, both peer fork pledges present with partition_hashes, assignment counts revert (7→5, 1→3), and system continues mining through the next epoch boundary

Each test has a distinct focus with minimal overlap: partition_recovery owns the data layer and deep recovery path, epoch_boundary owns storage module re-packing verification, and multi_epoch owns multi-epoch snapshot correctness.

Test plan

cargo nextest run -p irys-chain-tests heavy4_network_partition_recovery — passes (~27s)
cargo nextest run -p irys-chain-tests heavy4_partition_recovery_epoch_boundary — passes (~20s)
cargo nextest run -p irys-chain-tests heavy4_partition_recovery_multi_epoch — passes (~30s)
All 21 block_pool unit tests pass
All 13 fork_recovery/cascade_reorg tests pass (no regressions)
cargo fmt and cargo clippy clean

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Nodes can perform surgical recovery after deep network partitions: roll back and reconcile block index, storage mappings, and supply to match the adopted chain.
- Reorg handling improved to trigger recovery for deep forks and preserve canonical shared data.
Bug Fixes
- Prevents applying stale migration events during reorgs; migration now follows block-encoded ledger ordering.
Tests
- Added end-to-end partition recovery tests, including epoch-boundary and multi-epoch scenarios.

…grate their chunks map_storage_modules_to_partition_assignments() only processed Publish, Submit, and Capacity partitions — OneYear and ThirtyDay assignments existed in the epoch snapshot but were never forwarded to the StorageModuleService. on_block_migrated() only extracted Submit and Publish ledger transactions during chunk migration, so OneYear and ThirtyDay chunks were never written to storage modules.

When a reorg exceeds block_migration_depth, blocks have already been migrated to the block index with chunks assigned to storage module partitions. recover_from_network_partition() handles this by: 1. Collecting orphaned block chunk ranges, tx data_roots, and rewards 2. Clearing ChunkPathHashesByOffset and DataRootInfosByDataRoot entries for orphaned offsets/transactions only (preserving common ancestor data) 3. Marking orphaned offsets as Uninitialized so repacking can proceed 4. Truncating the block index to the fork point 5. Rolling back supply state cumulative_emitted The normal migration path then re-processes the new canonical chain.

The block pool previously rejected all blocks at already-indexed heights with different hashes as PartOfAPrunedFork. This prevented gossip-based delivery of competing fork blocks during deep reorgs, forcing callers to bypass the block pool entirely via send_full_block. The block tree already handles deep reorgs correctly via recover_from_network_partition — the block pool was the only barrier. Change block_status() to classify blocks at indexed heights with different hashes based on proximity to the latest index: - In the tree already → ProcessedButCanBeReorganized (fork in progress) - Within block_tree_depth → NotProcessed (potential deep reorg candidate) - Beyond block_tree_depth → PartOfAPrunedFork (DoS protection) This resolves the TODO: "this needs to handle migrated block reorgs".

Replace send_full_block with gossip_block_to_peers for delivering the peer's fork blocks to genesis. This exercises the real P2P code path that production nodes use when reconnecting after a network partition: gossip header → pull block body → pull ETH payload → validate → reorg. Remove the manual data tx/chunk pre-ingestion since the gossip pull mechanism handles block body (including tx headers) automatically.

coderabbitai · 2026-05-01T16:28:08Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds network-partition recovery: actors accept a storage-modules guard, BlockTree triggers BlockMigrationService recovery on deep reorgs, chunk migration gains staleness guards, block-status logic refines reorg classification, storage/supply/index APIs expose rollback helpers, and multi-node tests validate surgical recovery.

Changes

Network Partition Recovery

Layer / File(s)	Summary
Actor messages, reorg wiring & startup `crates/actors/src/block_tree_service.rs`, `crates/chain/src/chain.rs`	Adds `BlockTreeServiceMessage::SetStorageModulesGuard(StorageModulesReadGuard)`, forwards guard to `BlockMigrationService::set_storage_modules_guard()`, sends guard during init, and invokes `recover_from_network_partition()` when reorg depth exceeds migration threshold.
Chunk migration safety `crates/actors/src/chunk_migration_service.rs`	`on_block_migrated()` now verifies the incoming block is still canonical before migrating and iterates `block.data_ledgers` deterministically, skipping unknown/missing ledgers.
Block status decision & tests `crates/p2p/src/block_status_provider.rs`, `crates/p2p/src/tests/*`	Adds `max_reorg_depth` from `block_tree_depth`, refines mismatch handling to return `ProcessedButCanBeReorganized`/`NotProcessed`/`PartOfAPrunedFork`, updates constructor signature, and adjusts tests to pass a smaller `block_tree_depth`.
Integration tests: partition recovery `crates/chain-tests/src/multi_node/partition_recovery*.rs`, `crates/chain-tests/src/multi_node/mod.rs`	Adds three heavy multi-node Tokio tests exercising network partition, epoch-boundary, and multi-epoch reorg recovery; includes helpers to post chunked txs and assert storage-module, supply, and block-index state.
Storage, supply, and block-index APIs `crates/domain/src/models/storage_module.rs`, `crates/domain/src/models/supply_state.rs`, `crates/domain/src/models/block_index.rs`	Exposes `StorageModule` interval APIs, makes range-conversion and interval mutation methods public, adds `SupplyState::rollback_reward()`, and adds `BlockIndex::truncate_to_height()` for DB-backed truncation used during recovery.

Sequence Diagram

sequenceDiagram
    actor Node as Node (Genesis)
    participant BT as BlockTreeService
    participant BM as BlockMigrationService
    participant CM as ChunkMigrationService
    participant SM as StorageModule
    participant BI as BlockIndex
    participant SS as SupplyState

    Node->>BT: SetStorageModulesGuard(guard)
    BT->>BM: set_storage_modules_guard(guard)
    BM->>BM: store guard

    Node->>BT: Process reorg (old_fork_len > migration_depth)
    BT->>BM: recover_from_network_partition(fork_height)

    BM->>SM: gather orphaned ledger ranges & orphaned data_roots
    BM->>SM: clear offset/index mappings for orphaned txs
    BM->>SM: cut affected chunk intervals -> mark Uninitialized
    SM->>SM: write_intervals_to_submodules()

    BM->>BI: truncate_to_height(fork_parent)
    BI->>BI: delete_block_index_range(fork_parent+1..)

    BM->>SS: rollback_reward(fork_parent, reward_sum)
    SS->>SS: decrement cumulative_emitted

    BM-->>BT: Return success
    BT->>BT: Continue reorg metrics & broadcast

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Irys-xyz/irys#1143: Modifies reorg handling in BlockTreeService; related to recovery integration.
Irys-xyz/irys#1417: Touches BlockStatusProvider and block-pool gating logic; overlaps block-status decision changes.
Irys-xyz/irys#1104: Adds DB helper delete_block_index_range used by BlockIndex::truncate_to_height.

Suggested reviewers

JesseTheRobot
glottologist
antouhou

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: network partition recovery with deep reorg support and integration tests, which aligns with the primary objectives of the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dmac/network-partition-recovery

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/domain/src/models/storage_module.rs (1)
1631-1641: ⚠️ Potential issue | 🟠 Major

Compute intersection before subtraction to prevent underflow on overlapping ranges.

make_range_partition_relative performs unchecked subtraction that can underflow when callers pass ranges that extend before the storage module's range. Although get_overlapped_storage_modules filters for overlapping modules, overlap does not guarantee the input range starts within the module's bounds (e.g., range [50, 150] overlaps module [100, 200], but 50 - 100 underflows). This causes a panic in debug builds and silent wrapping in release builds.

The index_transaction_data method already demonstrates the correct pattern: compute the intersection before conversion. Apply the same approach here.
Proposed fix
 pub fn make_range_partition_relative(
     &self,
     chunk_range: LedgerChunkRange,
 ) -> eyre::Result<PartitionChunkRange> {
     let storage_module_range = self.get_storage_module_ledger_offsets()?;
-    let start = chunk_range.start() - storage_module_range.start();
-    let end = chunk_range.end() - storage_module_range.start();
+    let overlap = storage_module_range
+        .intersection(&chunk_range)
+        .ok_or_else(|| eyre::eyre!("chunk range does not overlap storage module range"))?;
+    let start = overlap.start() - storage_module_range.start();
+    let end = overlap.end() - storage_module_range.start();
     Ok(PartitionChunkRange(ii(
         PartitionChunkOffset::from(start),
         PartitionChunkOffset::from(end),
     )))
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/domain/src/models/storage_module.rs` around lines 1631 - 1641, The
function make_range_partition_relative currently subtracts storage offsets from
chunk_range directly which can underflow; change it to first compute the
intersection between chunk_range and the storage module range (using the same
pattern as index_transaction_data, e.g. let intersection =
chunk_range.intersection(&storage_module_range) or the project’s equivalent),
return an error if there is no intersection, then compute relative offsets from
that intersection (let start = intersection.start() -
storage_module_range.start(); let end = intersection.end() -
storage_module_range.start()) and construct the PartitionChunkRange from those
relative values; keep references to get_storage_module_ledger_offsets and
make_range_partition_relative to locate the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/actors/src/block_migration_service.rs`:
- Line 163: Replace the unchecked cast "rollback_count as usize" with a checked
conversion using TryFrom/TryInto: compute a capacity via "let capacity =
usize::try_from(rollback_count)?;" (or "let capacity =
rollback_count.try_into()?;" if using std::convert::TryInto) and then call
"Vec::with_capacity(capacity)"; if the surrounding function doesn't return
Result, propagate or handle the conversion error appropriately (e.g., map it
into the function's error type and return early). This change touches the
"rollback_count" conversion and the "rollback_infos" initialization
(Vec::with_capacity).

In `@crates/actors/src/chunk_migration_service.rs`:
- Around line 169-183: The preflight canonicality check using
block_index.get_item in the chunk migration handler is vulnerable to TOCTOU with
recover_from_network_partition; add a shared serialization point between
recovery and migration (either a migration generation/epoch or a shared lock)
and validate that generation/lock after the check and immediately before any
writes so stale BlockMigrated messages cannot re-create cleared state.
Concretely: introduce a migration_epoch (or use an existing RecoveryManager)
that is incremented by recover_from_network_partition, read the epoch in the
handler (alongside the block_index.get_item check) and re-check the epoch
immediately before performing index/chunk writes (or acquire the same
Mutex/RwLock used by recovery around validation+write). Update the chunk
migration code paths (the BlockMigrated handler and any functions that mutate
indexes/chunks) to abort when the epoch changed or lock cannot be held.

In `@crates/chain-tests/src/multi_node/partition_recovery.rs`:
- Around line 405-409: The verification vector uses the wrong ledger for
peer_publish_tx; replace DataLedger::Submit with DataLedger::Publish in the
peer_unique_txs definition so the tuple for peer_publish_tx matches the ledger
it was posted to (peer_publish_tx, DataLedger::Publish), keeping the other
entries (peer_one_year_tx -> DataLedger::OneYear, peer_thirty_day_tx ->
DataLedger::ThirtyDay) unchanged.
- Around line 342-346: The tuple list shared_txs uses the wrong ledger for
shared_publish_tx (it currently pairs shared_publish_tx with
DataLedger::Submit); update the tuple so shared_publish_tx is paired with
DataLedger::Publish (i.e., change the first tuple to (DataLedger::Publish,
&shared_publish_tx)) so the verification code (using shared_txs) queries the
same ledger that created the transaction.

---

Outside diff comments:
In `@crates/domain/src/models/storage_module.rs`:
- Around line 1631-1641: The function make_range_partition_relative currently
subtracts storage offsets from chunk_range directly which can underflow; change
it to first compute the intersection between chunk_range and the storage module
range (using the same pattern as index_transaction_data, e.g. let intersection =
chunk_range.intersection(&storage_module_range) or the project’s equivalent),
return an error if there is no intersection, then compute relative offsets from
that intersection (let start = intersection.start() -
storage_module_range.start(); let end = intersection.end() -
storage_module_range.start()) and construct the PartitionChunkRange from those
relative values; keep references to get_storage_module_ledger_offsets and
make_range_partition_relative to locate the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 921cee82-83fd-4bbb-b302-904d181b8b15

📥 Commits

Reviewing files that changed from the base of the PR and between fd107a5 and 7dd9994.

📒 Files selected for processing (13)

crates/actors/src/block_migration_service.rs
crates/actors/src/block_tree_service.rs
crates/actors/src/chunk_migration_service.rs
crates/chain-tests/src/multi_node/mod.rs
crates/chain-tests/src/multi_node/partition_recovery.rs
crates/chain/src/chain.rs
crates/domain/src/models/block_index.rs
crates/domain/src/models/storage_module.rs
crates/domain/src/models/supply_state.rs
crates/domain/src/snapshots/epoch_snapshot/epoch_snapshot.rs
crates/p2p/src/block_status_provider.rs
crates/p2p/src/tests/block_pool/mod.rs
crates/p2p/src/tests/util.rs

github-actions · 2026-05-01T16:43:23Z

Benchmark results: https://irys-xyz.github.io/irys/dev/bench/dmac%2Fnetwork-partition-recovery/index.html

Adds heavy4_partition_recovery_epoch_boundary which verifies that partition assignments are correctly rolled back when a reorg crosses an epoch boundary where each fork processed different pledge commitments. Tracks the specific affected storage module through the reorg and asserts it gets reassigned to a canonical partition with Entropy intervals from re-packing.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/chain-tests/src/multi_node/partition_recovery_epoch_boundary.rs`:
- Around line 26-29: Change this test to start from the shared testing builders:
call NodeConfig::testing() to create genesis_config (instead of
NodeConfig::testing_with_epochs) and then set the epoch length on its consensus
config via consensus = ConsensusConfig::testing() with the epoch length override
(num_blocks_in_epoch) and set chunk_size explicitly on
genesis_config.consensus.get_mut().chunk_size = 32; in short, replace the
testing_with_epochs path by creating genesis_config = NodeConfig::testing(),
build or derive a ConsensusConfig::testing() with the desired epoch length,
assign/merge it into genesis_config.consensus, and then set chunk_size on
genesis_config.consensus.get_mut().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8a8c8ddf-50e3-4fdf-a723-0ec66fd70321

📥 Commits

Reviewing files that changed from the base of the PR and between 7dd9994 and 2ade39d.

📒 Files selected for processing (2)

crates/chain-tests/src/multi_node/mod.rs
crates/chain-tests/src/multi_node/partition_recovery_epoch_boundary.rs

Operators and monitoring agents need a clear signal when a node recovers from a network partition. This adds an ERROR log with structured fields (fork_depth, new_fork_depth, fork_height, current_height) and actionable guidance to investigate peer connectivity.

Two new tests based on Codex security review findings: - partition_recovery_multi_epoch: reorg crossing 2 epoch boundaries, verifies epoch snapshot correctness, partition assignment rollback across multiple epochs, and continued mining after reorg. - partition_recovery_deep: deep reorg triggering recover_from_network_partition with block_migration_depth=1, verifies block index consistency, supply state rollback/re-migration, chain linkage across recovery boundary, and continued block production.

Remove the redundant partition_recovery_deep test and add its unique checks (block index hash verification, chain linkage across recovery boundary, continued mining + supply state after recovery) to the existing heavy4_network_partition_recovery test.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/actors/src/block_tree_service.rs`:
- Around line 685-698: The code currently casts usize -> u32 with `as` for
old_fork_blocks.len() and new_fork_blocks.len(); replace those with checked
conversions using TryFrom/TryInto (e.g. `let old_len: u32 =
old_fork_blocks.len().try_into()?; let new_len: u32 =
new_fork_blocks.len().try_into()?;`) and then use `old_len`/`new_len` in the
error! call and the subtraction; ensure you handle the conversion error (map to
a sensible error return or log and early-return) so the function’s error type or
control flow remains correct; update references to
`self.config.consensus.block_migration_depth` comparisons to use the converted
values or convert that config value to the same type before comparing.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0f69e9db-93a6-45a5-9420-e4a1842d8877

📥 Commits

Reviewing files that changed from the base of the PR and between 2ade39d and 242833d.

📒 Files selected for processing (5)

crates/actors/src/block_tree_service.rs
crates/chain-tests/src/multi_node/mod.rs
crates/chain-tests/src/multi_node/partition_recovery_deep.rs
crates/chain-tests/src/multi_node/partition_recovery_epoch_boundary.rs
crates/chain-tests/src/multi_node/partition_recovery_multi_epoch.rs

coderabbitai · 2026-05-03T15:21:00Z

+                    if old_fork_blocks.len() as u32 > self.config.consensus.block_migration_depth {
+                        error!(
+                            fork_depth = old_fork_blocks.len(),
+                            new_fork_depth = new_fork_blocks.len(),
+                            fork_height,
+                            current_height = arc_block.height,
+                            "NETWORK PARTITION RECOVERY: this node was isolated from the network \
+                             and fell behind the canonical chain. Deep reorg rolling back {} \
+                             migrated blocks. Investigate peer connectivity — check firewall \
+                             rules, network routes, and gossip health between this node and \
+                             its peers.",
+                            old_fork_blocks.len() as u32
+                                - self.config.consensus.block_migration_depth,
+                        );


🧹 Nitpick | 🔵 Trivial | 💤 Low value

Use checked conversions for runtime usize → u32 casts.

Per codebase conventions, production code should use try_from/try_into for non-literal numeric casts rather than as. While overflow is unlikely in practice (would require billions of fork blocks), using checked conversion is defensive and consistent with the established policy.

♻️ Proposed fix

- if old_fork_blocks.len() as u32 > self.config.consensus.block_migration_depth { + let old_fork_len = u32::try_from(old_fork_blocks.len()).unwrap_or(u32::MAX); + if old_fork_len > self.config.consensus.block_migration_depth { error!( - fork_depth = old_fork_blocks.len(), + fork_depth = old_fork_len, new_fork_depth = new_fork_blocks.len(), fork_height, current_height = arc_block.height, "NETWORK PARTITION RECOVERY: this node was isolated from the network \ and fell behind the canonical chain. Deep reorg rolling back {} \ migrated blocks. Investigate peer connectivity — check firewall \ rules, network routes, and gossip health between this node and \ its peers.", - old_fork_blocks.len() as u32 - - self.config.consensus.block_migration_depth, + old_fork_len.saturating_sub(self.config.consensus.block_migration_depth), );

Based on learnings: "In this codebase's production Rust code, do not use as numeric casts... Prefer checked conversions."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if old_fork_blocks.len() as u32 > self.config.consensus.block_migration_depth {

error!(

fork_depth = old_fork_blocks.len(),

new_fork_depth = new_fork_blocks.len(),

fork_height,

current_height = arc_block.height,

"NETWORK PARTITION RECOVERY: this node was isolated from the network \

and fell behind the canonical chain. Deep reorg rolling back {} \

migrated blocks. Investigate peer connectivity — check firewall \

rules, network routes, and gossip health between this node and \

its peers.",

old_fork_blocks.len() as u32

- self.config.consensus.block_migration_depth,

);

let old_fork_len = u32::try_from(old_fork_blocks.len()).unwrap_or(u32::MAX);

if old_fork_len > self.config.consensus.block_migration_depth {

error!(

fork_depth = old_fork_len,

new_fork_depth = new_fork_blocks.len(),

fork_height,

current_height = arc_block.height,

"NETWORK PARTITION RECOVERY: this node was isolated from the network \

and fell behind the canonical chain. Deep reorg rolling back {} \

migrated blocks. Investigate peer connectivity — check firewall \

rules, network routes, and gossip health between this node and \

its peers.",

old_fork_len.saturating_sub(self.config.consensus.block_migration_depth),

);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/actors/src/block_tree_service.rs` around lines 685 - 698, The code currently casts usize -> u32 with `as` for old_fork_blocks.len() and new_fork_blocks.len(); replace those with checked conversions using TryFrom/TryInto (e.g. `let old_len: u32 = old_fork_blocks.len().try_into()?; let new_len: u32 = new_fork_blocks.len().try_into()?;`) and then use `old_len`/`new_len` in the error! call and the subtraction; ensure you handle the conversion error (map to a sensible error return or log and early-return) so the function’s error type or control flow remains correct; update references to `self.config.consensus.block_migration_depth` comparisons to use the converted values or convert that config value to the same type before comparing.

…ministic iteration - block_tree_service: keep network partition recovery path over master's controlled-shutdown approach - chunk_migration_service: keep stale-message guard from our branch, adopt master's deterministic block.data_ledgers iteration order

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/chain-tests/src/multi_node/partition_recovery.rs`:
- Around line 185-191: The test discards the return from post_tx_with_chunks
when posting genesis_publish_chunks so the orphaned Publish/Submit data_root
from the genesis minority fork is never checked; capture and save the result of
post_tx_with_chunks(&genesis, &genesis_signer, &genesis_publish_chunks,
DataLedger::Publish).await? into a variable (e.g., genesis_publish_tx) and add
that variable to the genesis_orphaned_txs collection alongside
genesis_one_year_tx and genesis_thirty_day_tx so the data_root from the genesis
Publish path is asserted cleared during recovery (mirror the existing
winning-fork re-index checks).
- Around line 458-478: After the two mine_block().await? calls in the
continued-operation phase, wait for the block index migration to complete before
reading genesis.get_block_index_height(): call
genesis.wait_until_block_index_height(peer_height + 2). Then read
final_index_height and proceed with the supply checks (use the same
genesis.get_block_index_height(), final_supply via
genesis.node_ctx.supply_state_guard.get(), and the existing loop over block
heights) so the assertions won't race with asynchronous BlockMigrated
processing.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 82a53460-121a-4d23-b475-fad8f49ebe01

📥 Commits

Reviewing files that changed from the base of the PR and between 242833d and 725de6f.

📒 Files selected for processing (6)

crates/actors/src/block_tree_service.rs
crates/actors/src/chunk_migration_service.rs
crates/chain-tests/src/multi_node/mod.rs
crates/chain-tests/src/multi_node/partition_recovery.rs
crates/chain/src/chain.rs
crates/p2p/src/tests/util.rs

coderabbitai · 2026-05-07T02:39:31Z

+    post_tx_with_chunks(
+        &genesis,
+        &genesis_signer,
+        &genesis_publish_chunks,
+        DataLedger::Publish,
+    )
+    .await?;


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Orphaned genesis Publish tx data_root not verified for clearing — test coverage gap.

The post_tx_with_chunks result for genesis_publish_chunks (line 185) is discarded. Because the variable is absent, genesis_orphaned_txs (lines 409–411) only asserts that genesis_one_year_tx and genesis_thirty_day_tx data_roots were cleared after recovery. The orphaned Publish (and transitively Submit) data_root written by genesis's minority fork is never checked.

The comment at lines 422–424 explains why the interval check is skipped, but there is no analogous note here. Saving the return value and including it in genesis_orphaned_txs would close the gap and make the assertion symmetrical with the winning-fork re-index check (line 427).

🐛 Proposed fix

- post_tx_with_chunks( + let genesis_publish_tx = post_tx_with_chunks( &genesis, &genesis_signer, &genesis_publish_chunks, DataLedger::Publish, ) .await?;

let genesis_orphaned_txs: Vec<(DataLedger, &DataTransaction)> = vec![ + (DataLedger::Submit, &genesis_publish_tx), (DataLedger::OneYear, &genesis_one_year_tx), (DataLedger::ThirtyDay, &genesis_thirty_day_tx), ];

Also applies to: 408-420

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/chain-tests/src/multi_node/partition_recovery.rs` around lines 185 - 191, The test discards the return from post_tx_with_chunks when posting genesis_publish_chunks so the orphaned Publish/Submit data_root from the genesis minority fork is never checked; capture and save the result of post_tx_with_chunks(&genesis, &genesis_signer, &genesis_publish_chunks, DataLedger::Publish).await? into a variable (e.g., genesis_publish_tx) and add that variable to the genesis_orphaned_txs collection alongside genesis_one_year_tx and genesis_thirty_day_tx so the data_root from the genesis Publish path is asserted cleared during recovery (mirror the existing winning-fork re-index checks).

coderabbitai · 2026-05-07T02:39:31Z

+    let final_index_height = genesis.get_block_index_height();
+    assert!(
+        final_index_height > peer_height,
+        "Block index should continue advancing after recovery, got {final_index_height}"
+    );
+
+    let final_supply = genesis
+        .node_ctx
+        .supply_state_guard
+        .as_ref()
+        .expect("supply state not set")
+        .get();
+    let mut final_expected = irys_types::U256::zero();
+    for h in 1..=final_index_height {
+        let block = genesis.get_block_by_height_from_index(h, false)?;
+        final_expected = final_expected.saturating_add(block.reward_amount);
+    }
+    assert_eq!(
+        final_supply.cumulative_emitted, final_expected,
+        "Supply state should remain consistent after continued mining"
+    );


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Missing wait_until_block_index_height before reading final_index_height — potential spurious failure.

Every earlier stage in this test correctly calls wait_until_block_index_height before reading or asserting on the block index (lines 137–138, 261–263, 311–313). The continued-operation phase breaks this pattern: genesis.get_block_index_height() is read at line 458 immediately after two mine_block().await? calls, with no wait for migration.

Block migration fires asynchronously (delayed by block_migration_depth confirmations). After mining blocks 10 and 11, mine_block().await? returns as soon as the chain tip advances — the BlockMigrated event for block 10 (which would push the index from 8 → 9 → 10) may not have been processed yet. If the index is still at 8 or 9 when line 458 executes:

assert!(final_index_height > peer_height) (> 9) will fail spuriously.

The supply loop for h in 1..=final_index_height will compute a smaller sum than final_supply.cumulative_emitted (which was already updated), failing the assert_eq! at line 475.

🐛 Proposed fix

genesis.mine_block().await?; peer.wait_until_height(peer_height + 2, seconds_to_wait) .await?; + // Block migration is asynchronous; wait for the index to advance before reading. + let expected_final_index = peer_height + 1; // block 10 migrated after block 11 confirms it + genesis + .wait_until_block_index_height(expected_final_index, seconds_to_wait) + .await?; let final_index_height = genesis.get_block_index_height();

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/chain-tests/src/multi_node/partition_recovery.rs` around lines 458 - 478, After the two mine_block().await? calls in the continued-operation phase, wait for the block index migration to complete before reading genesis.get_block_index_height(): call genesis.wait_until_block_index_height(peer_height + 2). Then read final_index_height and proceed with the supply checks (use the same genesis.get_block_index_height(), final_supply via genesis.node_ctx.supply_state_guard.get(), and the existing loop over block heights) so the assertions won't race with asynchronous BlockMigrated processing.

Single conflict in block_status_provider.rs: - Kept our doc comment for PartOfAPrunedFork (describes deep reorg semantics) - Took master's granular ChainState match for non-indexed blocks (block pool race fix #1417) - Inlined is_block_in_the_tree() call for deep reorg check (replaces removed local variable) Merges: MDBX metrics, VDF progress/stall detection, preval perf, efficient-sampling, block pool race fix, Axiom removal, body-pull circuit-breaker fix, metrics followup.

…ion-recovery # Conflicts: # crates/actors/src/block_migration_service.rs

DanMacDonald added 7 commits April 30, 2026 13:07

fix(error): log when a reorg is passed the migration boundary

f1bf4f4

fix: Guard against stale BlockMigrated messages from orphaned forks.

addeb12

test: network partition recovery test

3dd37a5

DanMacDonald changed the title ~~Network partition recovery: deep reorg support + integration test~~ feat: Network partition recovery: deep reorg support + integration test May 1, 2026

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

Comment thread crates/actors/src/block_migration_service.rs

Comment thread crates/actors/src/chunk_migration_service.rs

Comment thread crates/chain-tests/src/multi_node/partition_recovery.rs

Comment thread crates/chain-tests/src/multi_node/partition_recovery.rs

DanMacDonald added 3 commits May 1, 2026 11:56

merge: resolve conflict with master, keep stale migration guard

722fc3e

style: fix formatting after merge

caa5fe5

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

Comment thread crates/chain-tests/src/multi_node/partition_recovery_epoch_boundary.rs

DanMacDonald added 4 commits May 1, 2026 12:28

fix: clarify pledge comment in epoch boundary test

3f5ae4b

coderabbitai Bot reviewed May 3, 2026

View reviewed changes

DanMacDonald changed the title ~~feat: Network partition recovery: deep reorg support + integration test~~ feat: Network partition recovery: deep reorg support + integration tests May 3, 2026

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

DanMacDonald added 2 commits May 19, 2026 11:32

Merge remote-tracking branch 'origin/master' into dmac/network-partit…

21c90e4

…ion-recovery # Conflicts: # crates/actors/src/block_migration_service.rs

DanMacDonald merged commit b099906 into master Jun 3, 2026
18 checks passed

DanMacDonald deleted the dmac/network-partition-recovery branch June 3, 2026 14:01

This was referenced Jun 12, 2026

Limit VDF run ahead to avoid seed poisening #1447

Open

fix(vdf): limit run-ahead via reset-seed confirmation gate #1449

Open

Conversation

DanMacDonald commented May 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Protocol fixes

Block pool fix

Integration tests

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DanMacDonald commented May 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 1, 2026 •

edited

Loading