feat: Network partition recovery: deep reorg support + integration tests#1405
Conversation
…grate their chunks map_storage_modules_to_partition_assignments() only processed Publish, Submit, and Capacity partitions — OneYear and ThirtyDay assignments existed in the epoch snapshot but were never forwarded to the StorageModuleService. on_block_migrated() only extracted Submit and Publish ledger transactions during chunk migration, so OneYear and ThirtyDay chunks were never written to storage modules.
When a reorg exceeds block_migration_depth, blocks have already been migrated to the block index with chunks assigned to storage module partitions. recover_from_network_partition() handles this by: 1. Collecting orphaned block chunk ranges, tx data_roots, and rewards 2. Clearing ChunkPathHashesByOffset and DataRootInfosByDataRoot entries for orphaned offsets/transactions only (preserving common ancestor data) 3. Marking orphaned offsets as Uninitialized so repacking can proceed 4. Truncating the block index to the fork point 5. Rolling back supply state cumulative_emitted The normal migration path then re-processes the new canonical chain.
The block pool previously rejected all blocks at already-indexed heights with different hashes as PartOfAPrunedFork. This prevented gossip-based delivery of competing fork blocks during deep reorgs, forcing callers to bypass the block pool entirely via send_full_block. The block tree already handles deep reorgs correctly via recover_from_network_partition — the block pool was the only barrier. Change block_status() to classify blocks at indexed heights with different hashes based on proximity to the latest index: - In the tree already → ProcessedButCanBeReorganized (fork in progress) - Within block_tree_depth → NotProcessed (potential deep reorg candidate) - Beyond block_tree_depth → PartOfAPrunedFork (DoS protection) This resolves the TODO: "this needs to handle migrated block reorgs".
Replace send_full_block with gossip_block_to_peers for delivering the peer's fork blocks to genesis. This exercises the real P2P code path that production nodes use when reconnecting after a network partition: gossip header → pull block body → pull ETH payload → validate → reorg. Remove the manual data tx/chunk pre-ingestion since the gossip pull mechanism handles block body (including tx headers) automatically.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds network-partition recovery: actors accept a storage-modules guard, BlockTree triggers BlockMigrationService recovery on deep reorgs, chunk migration gains staleness guards, block-status logic refines reorg classification, storage/supply/index APIs expose rollback helpers, and multi-node tests validate surgical recovery. ChangesNetwork Partition Recovery
Sequence DiagramsequenceDiagram
actor Node as Node (Genesis)
participant BT as BlockTreeService
participant BM as BlockMigrationService
participant CM as ChunkMigrationService
participant SM as StorageModule
participant BI as BlockIndex
participant SS as SupplyState
Node->>BT: SetStorageModulesGuard(guard)
BT->>BM: set_storage_modules_guard(guard)
BM->>BM: store guard
Node->>BT: Process reorg (old_fork_len > migration_depth)
BT->>BM: recover_from_network_partition(fork_height)
BM->>SM: gather orphaned ledger ranges & orphaned data_roots
BM->>SM: clear offset/index mappings for orphaned txs
BM->>SM: cut affected chunk intervals -> mark Uninitialized
SM->>SM: write_intervals_to_submodules()
BM->>BI: truncate_to_height(fork_parent)
BI->>BI: delete_block_index_range(fork_parent+1..)
BM->>SS: rollback_reward(fork_parent, reward_sum)
SS->>SS: decrement cumulative_emitted
BM-->>BT: Return success
BT->>BT: Continue reorg metrics & broadcast
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/domain/src/models/storage_module.rs (1)
1631-1641:⚠️ Potential issue | 🟠 MajorCompute intersection before subtraction to prevent underflow on overlapping ranges.
make_range_partition_relativeperforms unchecked subtraction that can underflow when callers pass ranges that extend before the storage module's range. Althoughget_overlapped_storage_modulesfilters for overlapping modules, overlap does not guarantee the input range starts within the module's bounds (e.g., range [50, 150] overlaps module [100, 200], but 50 - 100 underflows). This causes a panic in debug builds and silent wrapping in release builds.The
index_transaction_datamethod already demonstrates the correct pattern: compute the intersection before conversion. Apply the same approach here.Proposed fix
pub fn make_range_partition_relative( &self, chunk_range: LedgerChunkRange, ) -> eyre::Result<PartitionChunkRange> { let storage_module_range = self.get_storage_module_ledger_offsets()?; - let start = chunk_range.start() - storage_module_range.start(); - let end = chunk_range.end() - storage_module_range.start(); + let overlap = storage_module_range + .intersection(&chunk_range) + .ok_or_else(|| eyre::eyre!("chunk range does not overlap storage module range"))?; + let start = overlap.start() - storage_module_range.start(); + let end = overlap.end() - storage_module_range.start(); Ok(PartitionChunkRange(ii( PartitionChunkOffset::from(start), PartitionChunkOffset::from(end), ))) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/domain/src/models/storage_module.rs` around lines 1631 - 1641, The function make_range_partition_relative currently subtracts storage offsets from chunk_range directly which can underflow; change it to first compute the intersection between chunk_range and the storage module range (using the same pattern as index_transaction_data, e.g. let intersection = chunk_range.intersection(&storage_module_range) or the project’s equivalent), return an error if there is no intersection, then compute relative offsets from that intersection (let start = intersection.start() - storage_module_range.start(); let end = intersection.end() - storage_module_range.start()) and construct the PartitionChunkRange from those relative values; keep references to get_storage_module_ledger_offsets and make_range_partition_relative to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/actors/src/block_migration_service.rs`:
- Line 163: Replace the unchecked cast "rollback_count as usize" with a checked
conversion using TryFrom/TryInto: compute a capacity via "let capacity =
usize::try_from(rollback_count)?;" (or "let capacity =
rollback_count.try_into()?;" if using std::convert::TryInto) and then call
"Vec::with_capacity(capacity)"; if the surrounding function doesn't return
Result, propagate or handle the conversion error appropriately (e.g., map it
into the function's error type and return early). This change touches the
"rollback_count" conversion and the "rollback_infos" initialization
(Vec::with_capacity).
In `@crates/actors/src/chunk_migration_service.rs`:
- Around line 169-183: The preflight canonicality check using
block_index.get_item in the chunk migration handler is vulnerable to TOCTOU with
recover_from_network_partition; add a shared serialization point between
recovery and migration (either a migration generation/epoch or a shared lock)
and validate that generation/lock after the check and immediately before any
writes so stale BlockMigrated messages cannot re-create cleared state.
Concretely: introduce a migration_epoch (or use an existing RecoveryManager)
that is incremented by recover_from_network_partition, read the epoch in the
handler (alongside the block_index.get_item check) and re-check the epoch
immediately before performing index/chunk writes (or acquire the same
Mutex/RwLock used by recovery around validation+write). Update the chunk
migration code paths (the BlockMigrated handler and any functions that mutate
indexes/chunks) to abort when the epoch changed or lock cannot be held.
In `@crates/chain-tests/src/multi_node/partition_recovery.rs`:
- Around line 405-409: The verification vector uses the wrong ledger for
peer_publish_tx; replace DataLedger::Submit with DataLedger::Publish in the
peer_unique_txs definition so the tuple for peer_publish_tx matches the ledger
it was posted to (peer_publish_tx, DataLedger::Publish), keeping the other
entries (peer_one_year_tx -> DataLedger::OneYear, peer_thirty_day_tx ->
DataLedger::ThirtyDay) unchanged.
- Around line 342-346: The tuple list shared_txs uses the wrong ledger for
shared_publish_tx (it currently pairs shared_publish_tx with
DataLedger::Submit); update the tuple so shared_publish_tx is paired with
DataLedger::Publish (i.e., change the first tuple to (DataLedger::Publish,
&shared_publish_tx)) so the verification code (using shared_txs) queries the
same ledger that created the transaction.
---
Outside diff comments:
In `@crates/domain/src/models/storage_module.rs`:
- Around line 1631-1641: The function make_range_partition_relative currently
subtracts storage offsets from chunk_range directly which can underflow; change
it to first compute the intersection between chunk_range and the storage module
range (using the same pattern as index_transaction_data, e.g. let intersection =
chunk_range.intersection(&storage_module_range) or the project’s equivalent),
return an error if there is no intersection, then compute relative offsets from
that intersection (let start = intersection.start() -
storage_module_range.start(); let end = intersection.end() -
storage_module_range.start()) and construct the PartitionChunkRange from those
relative values; keep references to get_storage_module_ledger_offsets and
make_range_partition_relative to locate the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 921cee82-83fd-4bbb-b302-904d181b8b15
📒 Files selected for processing (13)
crates/actors/src/block_migration_service.rscrates/actors/src/block_tree_service.rscrates/actors/src/chunk_migration_service.rscrates/chain-tests/src/multi_node/mod.rscrates/chain-tests/src/multi_node/partition_recovery.rscrates/chain/src/chain.rscrates/domain/src/models/block_index.rscrates/domain/src/models/storage_module.rscrates/domain/src/models/supply_state.rscrates/domain/src/snapshots/epoch_snapshot/epoch_snapshot.rscrates/p2p/src/block_status_provider.rscrates/p2p/src/tests/block_pool/mod.rscrates/p2p/src/tests/util.rs
Adds heavy4_partition_recovery_epoch_boundary which verifies that partition assignments are correctly rolled back when a reorg crosses an epoch boundary where each fork processed different pledge commitments. Tracks the specific affected storage module through the reorg and asserts it gets reassigned to a canonical partition with Entropy intervals from re-packing.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/chain-tests/src/multi_node/partition_recovery_epoch_boundary.rs`:
- Around line 26-29: Change this test to start from the shared testing builders:
call NodeConfig::testing() to create genesis_config (instead of
NodeConfig::testing_with_epochs) and then set the epoch length on its consensus
config via consensus = ConsensusConfig::testing() with the epoch length override
(num_blocks_in_epoch) and set chunk_size explicitly on
genesis_config.consensus.get_mut().chunk_size = 32; in short, replace the
testing_with_epochs path by creating genesis_config = NodeConfig::testing(),
build or derive a ConsensusConfig::testing() with the desired epoch length,
assign/merge it into genesis_config.consensus, and then set chunk_size on
genesis_config.consensus.get_mut().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8a8c8ddf-50e3-4fdf-a723-0ec66fd70321
📒 Files selected for processing (2)
crates/chain-tests/src/multi_node/mod.rscrates/chain-tests/src/multi_node/partition_recovery_epoch_boundary.rs
Operators and monitoring agents need a clear signal when a node recovers from a network partition. This adds an ERROR log with structured fields (fork_depth, new_fork_depth, fork_height, current_height) and actionable guidance to investigate peer connectivity.
Two new tests based on Codex security review findings: - partition_recovery_multi_epoch: reorg crossing 2 epoch boundaries, verifies epoch snapshot correctness, partition assignment rollback across multiple epochs, and continued mining after reorg. - partition_recovery_deep: deep reorg triggering recover_from_network_partition with block_migration_depth=1, verifies block index consistency, supply state rollback/re-migration, chain linkage across recovery boundary, and continued block production.
Remove the redundant partition_recovery_deep test and add its unique checks (block index hash verification, chain linkage across recovery boundary, continued mining + supply state after recovery) to the existing heavy4_network_partition_recovery test.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/actors/src/block_tree_service.rs`:
- Around line 685-698: The code currently casts usize -> u32 with `as` for
old_fork_blocks.len() and new_fork_blocks.len(); replace those with checked
conversions using TryFrom/TryInto (e.g. `let old_len: u32 =
old_fork_blocks.len().try_into()?; let new_len: u32 =
new_fork_blocks.len().try_into()?;`) and then use `old_len`/`new_len` in the
error! call and the subtraction; ensure you handle the conversion error (map to
a sensible error return or log and early-return) so the function’s error type or
control flow remains correct; update references to
`self.config.consensus.block_migration_depth` comparisons to use the converted
values or convert that config value to the same type before comparing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0f69e9db-93a6-45a5-9420-e4a1842d8877
📒 Files selected for processing (5)
crates/actors/src/block_tree_service.rscrates/chain-tests/src/multi_node/mod.rscrates/chain-tests/src/multi_node/partition_recovery_deep.rscrates/chain-tests/src/multi_node/partition_recovery_epoch_boundary.rscrates/chain-tests/src/multi_node/partition_recovery_multi_epoch.rs
| if old_fork_blocks.len() as u32 > self.config.consensus.block_migration_depth { | ||
| error!( | ||
| fork_depth = old_fork_blocks.len(), | ||
| new_fork_depth = new_fork_blocks.len(), | ||
| fork_height, | ||
| current_height = arc_block.height, | ||
| "NETWORK PARTITION RECOVERY: this node was isolated from the network \ | ||
| and fell behind the canonical chain. Deep reorg rolling back {} \ | ||
| migrated blocks. Investigate peer connectivity — check firewall \ | ||
| rules, network routes, and gossip health between this node and \ | ||
| its peers.", | ||
| old_fork_blocks.len() as u32 | ||
| - self.config.consensus.block_migration_depth, | ||
| ); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | 💤 Low value
Use checked conversions for runtime usize → u32 casts.
Per codebase conventions, production code should use try_from/try_into for non-literal numeric casts rather than as. While overflow is unlikely in practice (would require billions of fork blocks), using checked conversion is defensive and consistent with the established policy.
♻️ Proposed fix
- if old_fork_blocks.len() as u32 > self.config.consensus.block_migration_depth {
+ let old_fork_len = u32::try_from(old_fork_blocks.len()).unwrap_or(u32::MAX);
+ if old_fork_len > self.config.consensus.block_migration_depth {
error!(
- fork_depth = old_fork_blocks.len(),
+ fork_depth = old_fork_len,
new_fork_depth = new_fork_blocks.len(),
fork_height,
current_height = arc_block.height,
"NETWORK PARTITION RECOVERY: this node was isolated from the network \
and fell behind the canonical chain. Deep reorg rolling back {} \
migrated blocks. Investigate peer connectivity — check firewall \
rules, network routes, and gossip health between this node and \
its peers.",
- old_fork_blocks.len() as u32
- - self.config.consensus.block_migration_depth,
+ old_fork_len.saturating_sub(self.config.consensus.block_migration_depth),
);Based on learnings: "In this codebase's production Rust code, do not use as numeric casts... Prefer checked conversions."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if old_fork_blocks.len() as u32 > self.config.consensus.block_migration_depth { | |
| error!( | |
| fork_depth = old_fork_blocks.len(), | |
| new_fork_depth = new_fork_blocks.len(), | |
| fork_height, | |
| current_height = arc_block.height, | |
| "NETWORK PARTITION RECOVERY: this node was isolated from the network \ | |
| and fell behind the canonical chain. Deep reorg rolling back {} \ | |
| migrated blocks. Investigate peer connectivity — check firewall \ | |
| rules, network routes, and gossip health between this node and \ | |
| its peers.", | |
| old_fork_blocks.len() as u32 | |
| - self.config.consensus.block_migration_depth, | |
| ); | |
| let old_fork_len = u32::try_from(old_fork_blocks.len()).unwrap_or(u32::MAX); | |
| if old_fork_len > self.config.consensus.block_migration_depth { | |
| error!( | |
| fork_depth = old_fork_len, | |
| new_fork_depth = new_fork_blocks.len(), | |
| fork_height, | |
| current_height = arc_block.height, | |
| "NETWORK PARTITION RECOVERY: this node was isolated from the network \ | |
| and fell behind the canonical chain. Deep reorg rolling back {} \ | |
| migrated blocks. Investigate peer connectivity — check firewall \ | |
| rules, network routes, and gossip health between this node and \ | |
| its peers.", | |
| old_fork_len.saturating_sub(self.config.consensus.block_migration_depth), | |
| ); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/actors/src/block_tree_service.rs` around lines 685 - 698, The code
currently casts usize -> u32 with `as` for old_fork_blocks.len() and
new_fork_blocks.len(); replace those with checked conversions using
TryFrom/TryInto (e.g. `let old_len: u32 = old_fork_blocks.len().try_into()?; let
new_len: u32 = new_fork_blocks.len().try_into()?;`) and then use
`old_len`/`new_len` in the error! call and the subtraction; ensure you handle
the conversion error (map to a sensible error return or log and early-return) so
the function’s error type or control flow remains correct; update references to
`self.config.consensus.block_migration_depth` comparisons to use the converted
values or convert that config value to the same type before comparing.
…ministic iteration - block_tree_service: keep network partition recovery path over master's controlled-shutdown approach - chunk_migration_service: keep stale-message guard from our branch, adopt master's deterministic block.data_ledgers iteration order
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/chain-tests/src/multi_node/partition_recovery.rs`:
- Around line 185-191: The test discards the return from post_tx_with_chunks
when posting genesis_publish_chunks so the orphaned Publish/Submit data_root
from the genesis minority fork is never checked; capture and save the result of
post_tx_with_chunks(&genesis, &genesis_signer, &genesis_publish_chunks,
DataLedger::Publish).await? into a variable (e.g., genesis_publish_tx) and add
that variable to the genesis_orphaned_txs collection alongside
genesis_one_year_tx and genesis_thirty_day_tx so the data_root from the genesis
Publish path is asserted cleared during recovery (mirror the existing
winning-fork re-index checks).
- Around line 458-478: After the two mine_block().await? calls in the
continued-operation phase, wait for the block index migration to complete before
reading genesis.get_block_index_height(): call
genesis.wait_until_block_index_height(peer_height + 2). Then read
final_index_height and proceed with the supply checks (use the same
genesis.get_block_index_height(), final_supply via
genesis.node_ctx.supply_state_guard.get(), and the existing loop over block
heights) so the assertions won't race with asynchronous BlockMigrated
processing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 82a53460-121a-4d23-b475-fad8f49ebe01
📒 Files selected for processing (6)
crates/actors/src/block_tree_service.rscrates/actors/src/chunk_migration_service.rscrates/chain-tests/src/multi_node/mod.rscrates/chain-tests/src/multi_node/partition_recovery.rscrates/chain/src/chain.rscrates/p2p/src/tests/util.rs
| post_tx_with_chunks( | ||
| &genesis, | ||
| &genesis_signer, | ||
| &genesis_publish_chunks, | ||
| DataLedger::Publish, | ||
| ) | ||
| .await?; |
There was a problem hiding this comment.
Orphaned genesis Publish tx data_root not verified for clearing — test coverage gap.
The post_tx_with_chunks result for genesis_publish_chunks (line 185) is discarded. Because the variable is absent, genesis_orphaned_txs (lines 409–411) only asserts that genesis_one_year_tx and genesis_thirty_day_tx data_roots were cleared after recovery. The orphaned Publish (and transitively Submit) data_root written by genesis's minority fork is never checked.
The comment at lines 422–424 explains why the interval check is skipped, but there is no analogous note here. Saving the return value and including it in genesis_orphaned_txs would close the gap and make the assertion symmetrical with the winning-fork re-index check (line 427).
🐛 Proposed fix
- post_tx_with_chunks(
+ let genesis_publish_tx = post_tx_with_chunks(
&genesis,
&genesis_signer,
&genesis_publish_chunks,
DataLedger::Publish,
)
.await?; let genesis_orphaned_txs: Vec<(DataLedger, &DataTransaction)> = vec![
+ (DataLedger::Submit, &genesis_publish_tx),
(DataLedger::OneYear, &genesis_one_year_tx),
(DataLedger::ThirtyDay, &genesis_thirty_day_tx),
];Also applies to: 408-420
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/chain-tests/src/multi_node/partition_recovery.rs` around lines 185 -
191, The test discards the return from post_tx_with_chunks when posting
genesis_publish_chunks so the orphaned Publish/Submit data_root from the genesis
minority fork is never checked; capture and save the result of
post_tx_with_chunks(&genesis, &genesis_signer, &genesis_publish_chunks,
DataLedger::Publish).await? into a variable (e.g., genesis_publish_tx) and add
that variable to the genesis_orphaned_txs collection alongside
genesis_one_year_tx and genesis_thirty_day_tx so the data_root from the genesis
Publish path is asserted cleared during recovery (mirror the existing
winning-fork re-index checks).
| let final_index_height = genesis.get_block_index_height(); | ||
| assert!( | ||
| final_index_height > peer_height, | ||
| "Block index should continue advancing after recovery, got {final_index_height}" | ||
| ); | ||
|
|
||
| let final_supply = genesis | ||
| .node_ctx | ||
| .supply_state_guard | ||
| .as_ref() | ||
| .expect("supply state not set") | ||
| .get(); | ||
| let mut final_expected = irys_types::U256::zero(); | ||
| for h in 1..=final_index_height { | ||
| let block = genesis.get_block_by_height_from_index(h, false)?; | ||
| final_expected = final_expected.saturating_add(block.reward_amount); | ||
| } | ||
| assert_eq!( | ||
| final_supply.cumulative_emitted, final_expected, | ||
| "Supply state should remain consistent after continued mining" | ||
| ); |
There was a problem hiding this comment.
Missing wait_until_block_index_height before reading final_index_height — potential spurious failure.
Every earlier stage in this test correctly calls wait_until_block_index_height before reading or asserting on the block index (lines 137–138, 261–263, 311–313). The continued-operation phase breaks this pattern: genesis.get_block_index_height() is read at line 458 immediately after two mine_block().await? calls, with no wait for migration.
Block migration fires asynchronously (delayed by block_migration_depth confirmations). After mining blocks 10 and 11, mine_block().await? returns as soon as the chain tip advances — the BlockMigrated event for block 10 (which would push the index from 8 → 9 → 10) may not have been processed yet. If the index is still at 8 or 9 when line 458 executes:
assert!(final_index_height > peer_height)(> 9) will fail spuriously.- The supply loop
for h in 1..=final_index_heightwill compute a smaller sum thanfinal_supply.cumulative_emitted(which was already updated), failing theassert_eq!at line 475.
🐛 Proposed fix
genesis.mine_block().await?;
peer.wait_until_height(peer_height + 2, seconds_to_wait)
.await?;
+ // Block migration is asynchronous; wait for the index to advance before reading.
+ let expected_final_index = peer_height + 1; // block 10 migrated after block 11 confirms it
+ genesis
+ .wait_until_block_index_height(expected_final_index, seconds_to_wait)
+ .await?;
let final_index_height = genesis.get_block_index_height();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/chain-tests/src/multi_node/partition_recovery.rs` around lines 458 -
478, After the two mine_block().await? calls in the continued-operation phase,
wait for the block index migration to complete before reading
genesis.get_block_index_height(): call
genesis.wait_until_block_index_height(peer_height + 2). Then read
final_index_height and proceed with the supply checks (use the same
genesis.get_block_index_height(), final_supply via
genesis.node_ctx.supply_state_guard.get(), and the existing loop over block
heights) so the assertions won't race with asynchronous BlockMigrated
processing.
Single conflict in block_status_provider.rs: - Kept our doc comment for PartOfAPrunedFork (describes deep reorg semantics) - Took master's granular ChainState match for non-indexed blocks (block pool race fix #1417) - Inlined is_block_in_the_tree() call for deep reorg check (replaces removed local variable) Merges: MDBX metrics, VDF progress/stall detection, preval perf, efficient-sampling, block pool race fix, Axiom removal, body-pull circuit-breaker fix, metrics followup.
…ion-recovery # Conflicts: # crates/actors/src/block_migration_service.rs
Summary
~295 lines of protocol fixes, ~1,470 lines of test code (1,765 total)
Protocol fixes
recover_from_network_partition: When a reorg exceedsblock_migration_depth, rolls back migrated state — clears orphaned storage module indexes, truncates the block index to the fork point, rolls back supply state — then re-migrates the winning forkmap_storage_modules_to_partition_assignments()now forwards OneYear and ThirtyDay partition assignments to StorageModuleService, andon_block_migrated()extracts their transactions for chunk migrationon_block_migratedchecks block-index membership before processing, preventing fire-and-forgetBlockMigratedmessages from orphaned forks from re-writing cleared data after recoveryBlock pool fix
PartOfAPrunedFork, preventing gossip-based delivery during deep reorgs. Now blocks withinblock_tree_depthof the latest index are allowed through as reorg candidates, letting the block tree (which is authoritative) decide viarecover_from_network_partitionIntegration tests
heavy4_network_partition_recovery— verifies the full recovery flow using real gossip delivery: builds a shared base chain, creates divergent forks, gossips the longer fork, and asserts shared data is preserved, orphaned data is cleared, winning fork data is re-indexed, supply state is correct, block index contains winning fork blocks with valid chain linkage across the recovery boundary, and the node continues mining with consistent supply state after recoveryheavy4_partition_recovery_epoch_boundary— verifies partition assignment rollback when a reorg crosses an epoch boundary with different pledge commitments on each fork. Two nodes each post a unique pledge on their fork and mine through an epoch boundary. After reorg, verifies commitment state, assignment counts (6→5, 1→2), storage module reassignment, and Entropy intervals proving re-pack executedheavy4_partition_recovery_multi_epoch— exercises a reorg crossing 2 epoch boundaries (epochs at heights 4 and 6). Each node posts 2 pledges across both epochs. After reorg, verifies epoch_height==6 (final epoch on winning fork), both genesis fork pledges absent, both peer fork pledges present with partition_hashes, assignment counts revert (7→5, 1→3), and system continues mining through the next epoch boundaryEach test has a distinct focus with minimal overlap:
partition_recoveryowns the data layer and deep recovery path,epoch_boundaryowns storage module re-packing verification, andmulti_epochowns multi-epoch snapshot correctness.Test plan
cargo nextest run -p irys-chain-tests heavy4_network_partition_recovery— passes (~27s)cargo nextest run -p irys-chain-tests heavy4_partition_recovery_epoch_boundary— passes (~20s)cargo nextest run -p irys-chain-tests heavy4_partition_recovery_multi_epoch— passes (~30s)cargo fmtandcargo clippyclean🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Tests