Fix/assigned miners determinism#1423
Draft
JesseTheRobot wants to merge 30 commits into
Draft
Conversation
Adds find_canonical_ledger_range() for finding the Submit-ledger
chunk range a tx contributed to its confirming canonical block,
without consulting CachedDataRoots.
Two-stage lookup:
1. Migrated path: IrysDataTxMetadata.included_height +
MigratedBlockHashes (O(1) DB read).
2. Pre-migration fallback: walk block_tree within
block_migration_depth, filtered to ChainState::Onchain.
Foundation for treating CachedDataRoots.block_set as a non-trusted
hint rather than a canonical-inclusion oracle.
Replaces the CachedDataRoots.block_set walk in get_assigned_ingress_proofs with a single call to tx_inclusion::find_canonical_ledger_range. The function now takes parent_height and looks up the canonical Submit-ledger range for the tx directly, instead of iterating block_set and calling get_ledger_range on each entry. block_set is append-only across reorgs — orphaned hashes are retained even after the blocks they point to are purged from the block_tree and DB. That is the validation-side root cause of the 2026-05-15 devnet self-fork: get_ledger_range returned None on an orphaned hash, surfacing as BlockBoundsLookupError on an otherwise-valid Publish promotion. Validation must not depend on block_set. Removes the now-unused get_ledger_range and get_block_by_hash helpers and the cached_data_root_by_data_root import. Updates tx_selector to pass current_height to the renamed signature. Adds a regression test (assigned_ingress_proofs_uses_canonical_tx_metadata) that verifies validation succeeds for a canonically-confirmed tx without reading CachedDataRoots.
Updates try_generate_ingress_proof_for_root to verify canonical tx confirmation via tx_inclusion::find_canonical_ledger_range as the trust root. block_set.is_empty() remains as a cheap early-return hint that short-circuits the helper lookup when no BlockConfirmed event has ever touched the data_root. Documents block_set as a hint-only field on CachedDataRoot — it is append-only across reorgs and is NOT fork-tolerant, so consumers must verify canonical truth via the helper rather than trusting block_set as a canonical-inclusion oracle. Updates the comment in cache_data_root and the chunk-ingress notification in mempool lifecycle to match the new framing.
Replaces the CachedDataRoots.block_set walk in prune_data_root_cache with a lookup over txid_set against IrysDataTxMetadata.included_height. Migrated tx metadata gives a stable canonical inclusion height that survives reorgs, while block_set is append-only and retains orphan hashes. No block_tree fallback is needed: pre-migration entries are still gated by expiry_height (set at cache_data_root time) and by the local-proof exemption further below.
The early-return on empty `block_set` was treating absence as proof of non-canonicity, but `block_set` is only populated by `BlockConfirmed` events. After a node restart where the confirmation was missed, or before the `TryGenerateProofsForConfirmedRoots` notification fires, a data_root can be canonically confirmed with an empty `block_set` — and the short-circuit would permanently skip ingress proof generation. Drop the hint and always run `tx_inclusion::find_canonical_ledger_range` as the trust root. The extra DB read is cheap relative to the correctness guarantee, and matches the branch's stated direction of not trusting `block_set` for canonical claims.
The prior comment claimed pre-migration entries were "already covered by expiry_height", but `cache_data_root` *clears* expiry_height when a block_header is provided. So a confirmed-but-not-yet-migrated entry has neither expiry_height nor included_height and lands in the conservative `(None, None)` skip arm. No behavior change: the entry becomes prunable once migration writes included_height (within ~block_migration_depth blocks), and the local-proof exemption further below covers the most important case. Just makes the rationale match what the code actually does so future maintainers aren't misled.
…aRoot.block_set Previously block_set was populated only by mempool's `on_block_confirmed`. That left windows where the set could drift from the canonical chain — missed events, mid-loop crashes, cold-start replay, DB-write errors the handler logs and skips — and it never removed hashes for blocks that were subsequently reorg'd out. Move authoritative block_set maintenance into `BlockMigrationService::persist_metadata` so it commits atomically with the canonical metadata write (IrysDataTxMetadata.included_height + MigratedBlockHashes): - Phase 1 augmentation: scrub the orphaned block_hash from any CachedDataRoot.block_set referencing a block in `blocks_to_clear`. - Phase 3 (new): append the new tip's block_hash to any existing CDR row for Submit-ledger txs in `blocks_to_confirm`. Update-only — no fabricating cache entries for data_roots the node never tracked. Two new helpers in irys-database back these phases: `update_data_root_block_set` (idempotent append + clear expiry_height) and `remove_data_root_block_set_entry`. `cache_data_root` is unchanged but is now an idempotent follow-up rather than the source of truth. With block_set now authoritative, revert chunk_ingress to the cheap `cdr.block_set.is_empty()` gate (the previous per-txid `tx_inclusion::find_canonical_ledger_range` call was O(N x DB read) on every chunk arrival). `tx_inclusion` stays as the canonical lookup for validation and cache pruning where range/slot truth is needed. Doc updates across CachedDataRoot, cache_data_root, on_block_confirmed, and the assigned_ingress_proofs regression test reflect the new authoritative-writer + idempotent-follow-up model.
The 100s value (Duration::from_millis(100000)) is far longer than any realistic test wait and would mask timeout-dependent behaviour around the circuit breaker. Drop both stubs to 5s.
Adds three counters and enriches log call-sites with fields that were
already in scope but not being emitted. Motivated by the 2026-05-15
devnet fork postmortem: the underlying failure modes were all silent or
correlation-only at the time they happened.
New metrics:
- irys.cache.cached_data_root_evicted_total
- irys.block.pre_validation_failed_total{reason}
- irys.chain_sync.block_rejected_total{reason}
Enriched logs:
- cached_data_root.write / .write_failed (lifecycle.rs + data_txs.rs):
structured data_root, tx.id, source, incoming_block fields.
- cached_data_root.evict (cache_service.rs): block_set at eviction,
inclusion height, expiry height, has_local_proof (variable, so a
future refactor that breaks the early-return invariant shows up).
- tx_selector.assigned_proofs_lookup_failed: tx.id, data_root,
current_height, error.
- block validation failed: %reason from new PreValidationError /
ValidationError::metric_reason() helpers (bounded cardinality).
- block_tree.transition: mark_tip / validated-as-Fork / remove_block
state changes are now explicit instead of inferrable only from
mark_tip's absence.
- Failed to remove block from cache: enriched with height /
prev_state / reason that were already computed but discarded.
- chain_sync.pull_and_process_block_failed: rejection_reason tag
derived from existing err variant.
`compute_submit_range` short-circuited on `total == 0` before running the `total < prev_total` regression check, so a `prev_total > 0, total = 0` header pair (data corruption) silently returned `Ok(None)` instead of `Err`. The pre-consolidation `get_ledger_range` ordered these correctly; the refactor inverted them. Drop the early return — every reachable case is already covered: zero/zero falls into the `total == prev_total` arm, regression falls into the `total < prev_total` arm, and the existing `total - 1` subtraction stays safe because we only reach it when `total > prev_total >= 0`. Includes a regression test that constructs the exact corruption shape (`prev_total = 100, total = 0`) and asserts `Err`.
The existing `assigned_ingress_proofs_uses_canonical_tx_metadata` and sibling `corrupt_cdr_does_not_affect_assigned_ingress_proofs` both pass an empty proof slice, so they return early at the `find_canonical_ledger_range` lookup and never enter the per-proof loop where slot ranges are intersected against the canonical block range. That loop is where the branch's behavioral change actually lives (single canonical range vs. the historical block_set walk). Add `assigned_ingress_proofs_classifies_intersecting_proof`: reuses the same h0/h1 canonical setup so the helper resolves `block_range = [10, 24]`, registers one signer to Submit slot 1 (chunks `[10, 20)` under `ConsensusConfig::testing().num_chunks_in_partition = 10`, so the slot range intersects), feeds one `generate_ingress_proof`-signed proof for the canonical data_root, and asserts the proof is classified as assigned with `miners == 1`. Regression guard against future changes to the intersection logic and slot-address counting.
`find_canonical_ledger_range` had three sites that silently degraded on missing canonical header data: 1. Migrated path: `MigratedBlockHashes[h] = hash` followed by a missing `IrysBlockHeaders[hash]` returned `Ok(None)`. The two tables disagree — surface as Err. 2. Migrated path: a non-genesis block whose `previous_block_hash` missed in `IrysBlockHeaders` fell through to `compute_submit_range(&block, None)`, which uses `prev_total = 0` and silently produces the wrong range `[0, total-1]` instead of `[prev_total, total-1]`. Treat as corruption. 3. Block-tree path: same prev-lookup hole when neither tree nor DB has the parent for a non-genesis block. Same silent-wrong-range outcome; same fix. Also reworks the pre-migration block-tree walk to anchor its migration window to `max_height` rather than the chain tip. The old tip-anchored slice silently dropped canonical confirmations whenever `max_height` was far enough below the tip that the slice contained only blocks with `height > max_height` — which is exactly the situation reorg-replay and historical-height validation hit. Errors now propagate to `PreValidationError::BlockBoundsLookupError`, which already has a `reason` tag and counter wired up. Three new regression tests cover the inconsistency shapes and the max_height-anchored window.
The block-tree fallback uses `rposition` to anchor its scan window to max_height; if max_height precedes every canonical entry that short-circuits to Ok(None). In production this is reachable when the canonical cache has rolled past the requested height (tip thousands of blocks ahead, cache holds only the last block_tree_depth). Add a focused unit test that constructs the scenario by seating the genesis at height 5 (no tx_ids so the seal check passes) with the tx_id on h6, then queries max_height=3. Verifies the helper returns Ok(None) without panicking on the slice bounds. Guards against a future refactor that drops the rposition None short-circuit.
…tate Under the lifetime model, a CachedDataRoot is bounded by either expiry_height (pre-confirmation, set by cache_data_root_with_expiry) or the canonical included_height of its txid_set (post-confirmation, set atomically with the tip change in BlockMigrationService::persist_metadata). The local-proof exemption in prune_data_root_cache extends the post-confirmation bound while a local ingress proof exists. Reaching the (None, None) arm — no inclusion height for any tx in txid_set AND no expiry_height — means the entry has lost both bounds. Reachable only when a reorg cleared included_height for an orphaned tx, expiry_height had been cleared at the now-orphaned confirmation, and the mempool re-anchor path did not restore expiry (e.g. orphan resubmission failed). Such entries should not exist. Previously the prune walk skipped them indefinitely. Now they are treated as eviction candidates, with the local-proof exemption preserved so entries carrying a useful commitment survive regardless of bound state. Net change: anomalous CDRs are actively reaped instead of accumulating. Additionally, surface anomalous writes at the source: cache_data_root and remove_data_root_block_set_entry now warn! when the post-write CDR has block_set empty and expiry_height None, so operators can detect the producer paths instead of only seeing eventual prune evictions. Updates the existing does_not_prune_unconfirmed_data_roots fixture to set expiry_height (mirroring cache_data_root_with_expiry) so it tests the realistic pre-confirmation state instead of the anomalous one. Adds prunes_anomalous_none_none_state and does_not_prune_anomalous_none_none_state_with_local_proof covering the new behaviour.
ValidationError::metric_reason() returned a stable snake_case tag for
every variant, but block_tree_service only fed pre-validation reasons
into irys.block.pre_validation_failed_total — non-pre-validation tags
were computed and used as the %reason log field but never reached a
counter. Asymmetric and a footgun: a future maintainer adding a new
ValidationError variant would reasonably assume the counter records
their tag.
Add a parallel counter irys.block.validation_failed_total{reason}
for the post-pre-validation failure modes (VDF, EL, shadow tx,
commitment, etc.) and route every ValidationError to exactly one
counter via is_pre_validation(). Reason taxonomy is shared.
Doc updates on PreValidationError::metric_reason,
ValidationError::metric_reason, and is_pre_validation reflect the
two-counter routing.
Make the rationale explicit: the conversion is infallible in practice (usize is at least u32 on every supported target), but kept as try_from + expect rather than `as usize` so a future widening of block_migration_depth to u64 surfaces here at the call site instead of truncating silently.
Direct DB-level tests pin the idempotency / no-op / expiry-handling
semantics of the two helpers `BlockMigrationService::persist_metadata`
relies on:
- update_data_root_block_set:
- returns false when CDR missing (must not fabricate)
- appends hash + clears expiry when both apply
- is idempotent when hash already present and expiry already cleared
- still clears expiry when hash already present
- remove_data_root_block_set_entry:
- returns false when CDR missing
- returns false when hash absent from block_set
- removes hash and leaves expiry_height untouched per docblock
- leaves CDR in (block_set=[], expiry_height=None) when that's the
natural outcome — the warn! at the call site is the operator
signal; the helper still completes successfully
The corruption case (`total < prev_total`) was already covered. Add the three happy paths so the arithmetic regresses cheaply: - genesis (prev = None, prev_total = 0, total > 0) → [0, total - 1] - intermediate (total > prev_total, both > 0) → [prev_total, total - 1] - zero delta (total == prev_total) → None - zero/zero edge → None (regression check must let this through to the equality arm rather than erroring on it) Previously these were only exercised indirectly through the e2e `migrated_tx_returns_canonical_range` / `pre_migration_tx_returns_range_via_block_tree` tests. Direct cases isolate the arithmetic so a regression surfaces at the unit level.
Focused unit tests for the authoritative-writer entry point. The function only touches self.db, so the other service fields (block_index_guard, cache, chunk_migration_sender, supply_state) are satisfied with cheap placeholders. Covers: - Phase 3 normal advance: existing CDR for a Submit tx in blocks_to_confirm gets the tip's block_hash appended and expiry_height cleared. - Phase 3 update-only contract: a Submit tx in blocks_to_confirm with no corresponding CDR row produces no CDR write. - Phase 1 scrub on reorg: existing CDR carrying the orphaned block's hash has that hash removed; expiry_height left untouched per remove_data_root_block_set_entry's docblock. - Phase 1 before Phase 3 ordering: same Submit tx in both an orphaned block and a new canonical block — Phase 1 scrubs the orphan hash first, Phase 3 appends the new canonical hash second. Final state has only the new hash. Guards the ordering invariant. Previously the persist_metadata changes (`c8e79e48a`) were only covered indirectly via reorg system tests.
…ers only Phase 1 (orphan clear) previously called batch_clear_data_tx_metadata across all ledger tx_ids, destroying the Submit-block included_height when only the Publish-promotion was being orphaned. Now uses per-ledger semantics: - Term ledgers (Submit / OneYear / ThirtyDay): batch_clear_data_tx_included_height - Publish ledger: batch_clear_data_tx_promoted_height (existing) Phase 2 (persist_metadata) and persist_block both called batch_set_data_tx_included_height for every ledger, silently overwriting the Submit-block height when processing a Publish block. Now gated: - Publish: set promoted_height only - Term ledgers: set included_height only Adds clear_data_tx_included_height / batch_clear_data_tx_included_height helpers to irys-database, mirroring the existing promoted_height variants. Tests added: - submit_to_publish_promotion_preserves_included_height - oneyear_direct_inclusion_sets_included_height - thirtyday_direct_inclusion_sets_included_height - phase1_orphan_publish_only_preserves_submit_included_height - phase1_orphan_submit_clears_included_height_and_deletes_row - phase1_orphan_oneyear_clears_included_height - phase1_orphan_thirtyday_clears_included_height - clear_included_height_only_set_deletes_row (db_index) - clear_included_height_preserves_promoted_height (db_index) - clear_included_height_missing_row_is_noop (db_index) - batch_clear_included_height_happy_path (db_index)
…_block_confirmed_updates apply_block_confirmed_updates accepted `submit_and_publish_txids` and unconditionally set `included_height` on every tx found in that slice, including Publish-promoted txs. This overwrote the Submit-block height recorded at first confirmation with the Publish-block height. Fix: rename the parameter to `term_ledger_txids` and update the call-site in handle_block_confirmed_message to build the slice from Submit + OneYear + ThirtyDay only (excluding Publish). Publish txids continue to flow through the existing phase-3 path which sets `promoted_height` only. Also removes the misleading `overwrite = true` debug field from the log. Tests added: - publish_confirmation_does_not_overwrite_submit_included_height - oneyear_direct_confirmation_sets_included_height - thirtyday_direct_confirmation_sets_included_height - publish_only_tx_gets_promoted_height_not_included_height - success_path_applies_all_three_update_phases_atomically (updated)
…bug_asserts P1-1: Swap find_canonical_ledger_range to prefer block_tree over migrated DB. The tree is the live source of truth; DB is the long-term backstop. Tree-first eliminates timing edge cases at the migration boundary and is more defensive against future DB metadata bugs. P1-2: Add Submit-membership guard in lookup_via_migrated_metadata. After P0-1, included_height for OneYear/ThirtyDay txs points at their own term-ledger block. If a future caller passes such a tx_id, compute_submit_range would return a wrong Submit delta. Mirror the guard already present in lookup_via_block_tree. P3-1: Two debug_asserts for invariant documentation: - block_migration_depth > 0 in lookup_via_block_tree - prev.is_some() || height == 0 in compute_submit_range Tests: tree_wins_over_corrupt_db_entry (P1-1), migrated_oneyear_only_tx_returns_none, migrated_thirtyday_only_tx_returns_none, migrated_submit_tx_returns_correct_range_after_guard (P1-2).
…gress_proofs
CONSENSUS RULE CHANGE — must be coordinated with a network upgrade.
Old semantics: assigned_miners was set inside the per-proof loop, once per
intersecting slot found by iterating `slot_ranges: HashMap<usize, LedgerChunkRange>`
(non-deterministic RandomState). When a miner held multiple intersecting slots the
value written was whichever slot the HashMap yielded first. The value was also
re-overwritten on each proof iteration ("last write wins"), so the final result
depended on both proof ordering and HashMap randomness — a consensus-fork vector.
New semantics: assigned_miners = count of unique miner addresses across ALL
Submit-ledger slots whose chunk range intersects block_range. This is a pure
function of (block_range, epoch_snapshot) computed once, before the per-proof loop.
The per-proof loop now only classifies proofs as assigned-or-not.
Why the new semantics is correct:
1. Deterministic: same result on every node regardless of HashMap key order or
proof submission order.
2. Matches the variable name and the intent of the caller (the
expected_assigned_proofs clamp): "how many distinct miners are eligible to
contribute an assigned proof?" is a property of the epoch, not the proof set.
3. Robust: if the "same miner never assigned to the same slot twice" invariant
is ever violated, we still get a correct unique-address count.
When block_range spans multiple slots whose miner sets overlap, the new value is
>= the old value (strictly larger if any miner covers more than one intersecting
slot). This is a consensus rule change.
Also subsumes the pre-existing "last write wins" latent bug in the old proof loop
(the rewrite naturally eliminates it).
Tests added:
- assigned_miners_multi_slot_unique_address_count (regression: same miner in two
slots, must equal 1 unique address across 5 repeated calls)
- assigned_miners_single_slot_two_miners (single slot, two distinct miners,
assigned_miners == 2 regardless of how many proofs submitted)
- assigned_miners_zero_proofs_returns_correct_counts (zero proofs still counts
intersecting miners correctly; empty epoch yields 0)
HashSet iteration order varies across runs; when multiple orphans share the same parent, the order in which they are submitted to the block pool was non-deterministic. This makes transient-tip selection stable and prevents the ordering from being observable via RPC or gossip before the next tip change resolves it. No unit test added — the effect is only observable at integration/network level (which transient tip wins before the next canonical update). The fix is a single sort_by_key call; correctness follows from H256 implementing Ord deterministically.
…eSet Switch `solutions: HashMap<H256, HashSet<BlockHash>>` to `BTreeMap<H256, BTreeSet<BlockHash>>`. BTreeSet iterates ascending by BlockHash, so `get_by_solution_hash`'s "return first match" semantics become a "return smallest BlockHash" tie-break automatically — no explicit sort needed at the call site. Previously the field was iterated in HashMap/HashSet order (RandomState), so when multiple blocks shared a solution_hash and qualified under the same arm (exact-cumdiff match, double-signing, or fallback), which one was returned varied across nodes — a consensus- fork vector for any future consensus-path caller. No production callers exist today (the function is staged for double- signing detection), but the `solutions` field has active writers in `insert_block`/`remove_block` and a separate reader in `is_known_solution_hash`, so the structure is being maintained for a future reader. Locking in determinism now means that reader can be added without a hidden fork. Companion regression tests assert stable selection across 100 iterations under both the exact-cumdiff-match arm and the `best_block` fallback arm.
Collapse OneYear+ThirtyDay duplicates into rstest cases for both direct_inclusion_sets_included_height and phase1_orphan_term_ledger_clears_included_height.
…ty test Splits the per-tx Publish proof clamp-and-check into a pure helper so unit tests can exercise the production logic byte-for-byte instead of modelling it. Adds `assigned_miners_rule_change_invisible_under_current_config`, which uses that helper to prove the consensus rule change introduced in d6e0338 is invisible under every deployed `ConsensusConfig` (mainnet/testnet/testing all have `number_of_ingress_proofs_from_assignees = 0`): - old/new `assigned_miners` values diverge (1 vs. 2 in the Alice-in-slot-1, Bob-in-slot-2 scenario), confirming the rule change is real - under `from_assignees = 0` both values produce ACCEPT (clamp + count check are both no-ops), so today's network sees no observable behavior change - under `from_assignees = 2` the values diverge in outcome (old ACCEPTS, new REJECTS with `AssignedProofCountMismatch { expected: 2, actual: 1 }`) The test pins the "no separate hardfork gate required" conclusion: activation of the new semantics is implicitly tied to the future hardfork that bumps `from_assignees > 0`, which is itself a coordinated network upgrade.
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: ASSERTIVE Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Describe the changes
A clear and concise description of what the bug fix, feature, or improvement is.
Related Issue(s)
Please link to the issue(s) that will be closed with this PR.
Checklist
Additional Context
Add any other context about the pull request here.