Skip to content

Fix/assigned miners determinism#1423

Draft
JesseTheRobot wants to merge 30 commits into
masterfrom
fix/assigned-miners-determinism
Draft

Fix/assigned miners determinism#1423
JesseTheRobot wants to merge 30 commits into
masterfrom
fix/assigned-miners-determinism

Conversation

@JesseTheRobot

Copy link
Copy Markdown
Member

Describe the changes
A clear and concise description of what the bug fix, feature, or improvement is.

Related Issue(s)
Please link to the issue(s) that will be closed with this PR.

Checklist

  • Tests have been added/updated for the changes.
  • Documentation has been updated for the changes (if applicable).
  • The code follows Rust's style guidelines.

Additional Context
Add any other context about the pull request here.

Adds find_canonical_ledger_range() for finding the Submit-ledger
chunk range a tx contributed to its confirming canonical block,
without consulting CachedDataRoots.

Two-stage lookup:
  1. Migrated path: IrysDataTxMetadata.included_height +
     MigratedBlockHashes (O(1) DB read).
  2. Pre-migration fallback: walk block_tree within
     block_migration_depth, filtered to ChainState::Onchain.

Foundation for treating CachedDataRoots.block_set as a non-trusted
hint rather than a canonical-inclusion oracle.
Replaces the CachedDataRoots.block_set walk in get_assigned_ingress_proofs
with a single call to tx_inclusion::find_canonical_ledger_range.  The
function now takes parent_height and looks up the canonical Submit-ledger
range for the tx directly, instead of iterating block_set and calling
get_ledger_range on each entry.

block_set is append-only across reorgs — orphaned hashes are retained
even after the blocks they point to are purged from the block_tree and
DB.  That is the validation-side root cause of the 2026-05-15 devnet
self-fork: get_ledger_range returned None on an orphaned hash, surfacing
as BlockBoundsLookupError on an otherwise-valid Publish promotion.

Validation must not depend on block_set.  Removes the now-unused
get_ledger_range and get_block_by_hash helpers and the
cached_data_root_by_data_root import.  Updates tx_selector to pass
current_height to the renamed signature.

Adds a regression test (assigned_ingress_proofs_uses_canonical_tx_metadata)
that verifies validation succeeds for a canonically-confirmed tx without
reading CachedDataRoots.
Updates try_generate_ingress_proof_for_root to verify canonical tx
confirmation via tx_inclusion::find_canonical_ledger_range as the
trust root.  block_set.is_empty() remains as a cheap early-return
hint that short-circuits the helper lookup when no BlockConfirmed
event has ever touched the data_root.

Documents block_set as a hint-only field on CachedDataRoot — it is
append-only across reorgs and is NOT fork-tolerant, so consumers
must verify canonical truth via the helper rather than trusting
block_set as a canonical-inclusion oracle.  Updates the comment in
cache_data_root and the chunk-ingress notification in mempool
lifecycle to match the new framing.
Replaces the CachedDataRoots.block_set walk in prune_data_root_cache
with a lookup over txid_set against IrysDataTxMetadata.included_height.

Migrated tx metadata gives a stable canonical inclusion height that
survives reorgs, while block_set is append-only and retains orphan
hashes.  No block_tree fallback is needed: pre-migration entries are
still gated by expiry_height (set at cache_data_root time) and by the
local-proof exemption further below.
The early-return on empty `block_set` was treating absence as proof of
non-canonicity, but `block_set` is only populated by `BlockConfirmed`
events.  After a node restart where the confirmation was missed, or
before the `TryGenerateProofsForConfirmedRoots` notification fires, a
data_root can be canonically confirmed with an empty `block_set` — and
the short-circuit would permanently skip ingress proof generation.

Drop the hint and always run `tx_inclusion::find_canonical_ledger_range`
as the trust root.  The extra DB read is cheap relative to the
correctness guarantee, and matches the branch's stated direction of
not trusting `block_set` for canonical claims.
The prior comment claimed pre-migration entries were "already covered
by expiry_height", but `cache_data_root` *clears* expiry_height when a
block_header is provided.  So a confirmed-but-not-yet-migrated entry
has neither expiry_height nor included_height and lands in the
conservative `(None, None)` skip arm.

No behavior change: the entry becomes prunable once migration writes
included_height (within ~block_migration_depth blocks), and the
local-proof exemption further below covers the most important case.
Just makes the rationale match what the code actually does so future
maintainers aren't misled.
…aRoot.block_set

Previously block_set was populated only by mempool's `on_block_confirmed`.
That left windows where the set could drift from the canonical chain —
missed events, mid-loop crashes, cold-start replay, DB-write errors the
handler logs and skips — and it never removed hashes for blocks that
were subsequently reorg'd out.

Move authoritative block_set maintenance into
`BlockMigrationService::persist_metadata` so it commits atomically with
the canonical metadata write (IrysDataTxMetadata.included_height +
MigratedBlockHashes):

- Phase 1 augmentation: scrub the orphaned block_hash from any
  CachedDataRoot.block_set referencing a block in `blocks_to_clear`.
- Phase 3 (new): append the new tip's block_hash to any existing CDR
  row for Submit-ledger txs in `blocks_to_confirm`.  Update-only — no
  fabricating cache entries for data_roots the node never tracked.

Two new helpers in irys-database back these phases:
`update_data_root_block_set` (idempotent append + clear expiry_height)
and `remove_data_root_block_set_entry`.  `cache_data_root` is unchanged
but is now an idempotent follow-up rather than the source of truth.

With block_set now authoritative, revert chunk_ingress to the cheap
`cdr.block_set.is_empty()` gate (the previous per-txid
`tx_inclusion::find_canonical_ledger_range` call was O(N x DB read) on
every chunk arrival).  `tx_inclusion` stays as the canonical lookup
for validation and cache pruning where range/slot truth is needed.

Doc updates across CachedDataRoot, cache_data_root, on_block_confirmed,
and the assigned_ingress_proofs regression test reflect the new
authoritative-writer + idempotent-follow-up model.
The 100s value (Duration::from_millis(100000)) is far longer than any
realistic test wait and would mask timeout-dependent behaviour around
the circuit breaker. Drop both stubs to 5s.
Adds three counters and enriches log call-sites with fields that were
already in scope but not being emitted.  Motivated by the 2026-05-15
devnet fork postmortem: the underlying failure modes were all silent or
correlation-only at the time they happened.

New metrics:
  - irys.cache.cached_data_root_evicted_total
  - irys.block.pre_validation_failed_total{reason}
  - irys.chain_sync.block_rejected_total{reason}

Enriched logs:
  - cached_data_root.write / .write_failed (lifecycle.rs + data_txs.rs):
    structured data_root, tx.id, source, incoming_block fields.
  - cached_data_root.evict (cache_service.rs): block_set at eviction,
    inclusion height, expiry height, has_local_proof (variable, so a
    future refactor that breaks the early-return invariant shows up).
  - tx_selector.assigned_proofs_lookup_failed: tx.id, data_root,
    current_height, error.
  - block validation failed: %reason from new PreValidationError /
    ValidationError::metric_reason() helpers (bounded cardinality).
  - block_tree.transition: mark_tip / validated-as-Fork / remove_block
    state changes are now explicit instead of inferrable only from
    mark_tip's absence.
  - Failed to remove block from cache: enriched with height /
    prev_state / reason that were already computed but discarded.
  - chain_sync.pull_and_process_block_failed: rejection_reason tag
    derived from existing err variant.
`compute_submit_range` short-circuited on `total == 0` before running
the `total < prev_total` regression check, so a `prev_total > 0,
total = 0` header pair (data corruption) silently returned `Ok(None)`
instead of `Err`.  The pre-consolidation `get_ledger_range` ordered
these correctly; the refactor inverted them.

Drop the early return — every reachable case is already covered:
zero/zero falls into the `total == prev_total` arm, regression falls
into the `total < prev_total` arm, and the existing `total - 1`
subtraction stays safe because we only reach it when `total >
prev_total >= 0`.

Includes a regression test that constructs the exact corruption shape
(`prev_total = 100, total = 0`) and asserts `Err`.
The existing `assigned_ingress_proofs_uses_canonical_tx_metadata` and
sibling `corrupt_cdr_does_not_affect_assigned_ingress_proofs` both pass
an empty proof slice, so they return early at the
`find_canonical_ledger_range` lookup and never enter the per-proof loop
where slot ranges are intersected against the canonical block range.
That loop is where the branch's behavioral change actually lives
(single canonical range vs. the historical block_set walk).

Add `assigned_ingress_proofs_classifies_intersecting_proof`: reuses the
same h0/h1 canonical setup so the helper resolves `block_range = [10,
24]`, registers one signer to Submit slot 1 (chunks `[10, 20)` under
`ConsensusConfig::testing().num_chunks_in_partition = 10`, so the slot
range intersects), feeds one `generate_ingress_proof`-signed proof for
the canonical data_root, and asserts the proof is classified as
assigned with `miners == 1`.

Regression guard against future changes to the intersection logic and
slot-address counting.
`find_canonical_ledger_range` had three sites that silently degraded
on missing canonical header data:

1. Migrated path: `MigratedBlockHashes[h] = hash` followed by a missing
   `IrysBlockHeaders[hash]` returned `Ok(None)`.  The two tables
   disagree — surface as Err.
2. Migrated path: a non-genesis block whose `previous_block_hash`
   missed in `IrysBlockHeaders` fell through to
   `compute_submit_range(&block, None)`, which uses `prev_total = 0`
   and silently produces the wrong range `[0, total-1]` instead of
   `[prev_total, total-1]`.  Treat as corruption.
3. Block-tree path: same prev-lookup hole when neither tree nor DB has
   the parent for a non-genesis block.  Same silent-wrong-range
   outcome; same fix.

Also reworks the pre-migration block-tree walk to anchor its migration
window to `max_height` rather than the chain tip.  The old tip-anchored
slice silently dropped canonical confirmations whenever `max_height`
was far enough below the tip that the slice contained only blocks with
`height > max_height` — which is exactly the situation reorg-replay
and historical-height validation hit.

Errors now propagate to `PreValidationError::BlockBoundsLookupError`,
which already has a `reason` tag and counter wired up.  Three new
regression tests cover the inconsistency shapes and the
max_height-anchored window.
The block-tree fallback uses `rposition` to anchor its scan window to
max_height; if max_height precedes every canonical entry that
short-circuits to Ok(None).  In production this is reachable when the
canonical cache has rolled past the requested height (tip thousands of
blocks ahead, cache holds only the last block_tree_depth).

Add a focused unit test that constructs the scenario by seating the
genesis at height 5 (no tx_ids so the seal check passes) with the
tx_id on h6, then queries max_height=3.  Verifies the helper returns
Ok(None) without panicking on the slice bounds.  Guards against a
future refactor that drops the rposition None short-circuit.
…tate

Under the lifetime model, a CachedDataRoot is bounded by either
expiry_height (pre-confirmation, set by cache_data_root_with_expiry)
or the canonical included_height of its txid_set (post-confirmation,
set atomically with the tip change in
BlockMigrationService::persist_metadata).  The local-proof exemption
in prune_data_root_cache extends the post-confirmation bound while a
local ingress proof exists.

Reaching the (None, None) arm — no inclusion height for any tx in
txid_set AND no expiry_height — means the entry has lost both bounds.
Reachable only when a reorg cleared included_height for an orphaned tx,
expiry_height had been cleared at the now-orphaned confirmation, and
the mempool re-anchor path did not restore expiry (e.g. orphan
resubmission failed).  Such entries should not exist.

Previously the prune walk skipped them indefinitely.  Now they are
treated as eviction candidates, with the local-proof exemption
preserved so entries carrying a useful commitment survive regardless
of bound state.  Net change: anomalous CDRs are actively reaped
instead of accumulating.

Additionally, surface anomalous writes at the source: cache_data_root
and remove_data_root_block_set_entry now warn! when the post-write CDR
has block_set empty and expiry_height None, so operators can detect
the producer paths instead of only seeing eventual prune evictions.

Updates the existing does_not_prune_unconfirmed_data_roots fixture to
set expiry_height (mirroring cache_data_root_with_expiry) so it tests
the realistic pre-confirmation state instead of the anomalous one.
Adds prunes_anomalous_none_none_state and
does_not_prune_anomalous_none_none_state_with_local_proof covering
the new behaviour.
ValidationError::metric_reason() returned a stable snake_case tag for
every variant, but block_tree_service only fed pre-validation reasons
into irys.block.pre_validation_failed_total — non-pre-validation tags
were computed and used as the %reason log field but never reached a
counter.  Asymmetric and a footgun: a future maintainer adding a new
ValidationError variant would reasonably assume the counter records
their tag.

Add a parallel counter irys.block.validation_failed_total{reason}
for the post-pre-validation failure modes (VDF, EL, shadow tx,
commitment, etc.) and route every ValidationError to exactly one
counter via is_pre_validation().  Reason taxonomy is shared.

Doc updates on PreValidationError::metric_reason,
ValidationError::metric_reason, and is_pre_validation reflect the
two-counter routing.
Make the rationale explicit: the conversion is infallible in practice
(usize is at least u32 on every supported target), but kept as
try_from + expect rather than `as usize` so a future widening of
block_migration_depth to u64 surfaces here at the call site instead
of truncating silently.
Direct DB-level tests pin the idempotency / no-op / expiry-handling
semantics of the two helpers `BlockMigrationService::persist_metadata`
relies on:

- update_data_root_block_set:
  - returns false when CDR missing (must not fabricate)
  - appends hash + clears expiry when both apply
  - is idempotent when hash already present and expiry already cleared
  - still clears expiry when hash already present

- remove_data_root_block_set_entry:
  - returns false when CDR missing
  - returns false when hash absent from block_set
  - removes hash and leaves expiry_height untouched per docblock
  - leaves CDR in (block_set=[], expiry_height=None) when that's the
    natural outcome — the warn! at the call site is the operator
    signal; the helper still completes successfully
The corruption case (`total < prev_total`) was already covered.  Add
the three happy paths so the arithmetic regresses cheaply:

- genesis (prev = None, prev_total = 0, total > 0) → [0, total - 1]
- intermediate (total > prev_total, both > 0) → [prev_total, total - 1]
- zero delta (total == prev_total) → None
- zero/zero edge → None (regression check must let this through to
  the equality arm rather than erroring on it)

Previously these were only exercised indirectly through the e2e
`migrated_tx_returns_canonical_range` /
`pre_migration_tx_returns_range_via_block_tree` tests.  Direct cases
isolate the arithmetic so a regression surfaces at the unit level.
Focused unit tests for the authoritative-writer entry point.  The
function only touches self.db, so the other service fields
(block_index_guard, cache, chunk_migration_sender, supply_state) are
satisfied with cheap placeholders.

Covers:

- Phase 3 normal advance: existing CDR for a Submit tx in
  blocks_to_confirm gets the tip's block_hash appended and
  expiry_height cleared.
- Phase 3 update-only contract: a Submit tx in blocks_to_confirm with
  no corresponding CDR row produces no CDR write.
- Phase 1 scrub on reorg: existing CDR carrying the orphaned block's
  hash has that hash removed; expiry_height left untouched per
  remove_data_root_block_set_entry's docblock.
- Phase 1 before Phase 3 ordering: same Submit tx in both an orphaned
  block and a new canonical block — Phase 1 scrubs the orphan hash
  first, Phase 3 appends the new canonical hash second.  Final state
  has only the new hash.  Guards the ordering invariant.

Previously the persist_metadata changes (`c8e79e48a`) were only
covered indirectly via reorg system tests.
…ers only

Phase 1 (orphan clear) previously called batch_clear_data_tx_metadata across
all ledger tx_ids, destroying the Submit-block included_height when only the
Publish-promotion was being orphaned. Now uses per-ledger semantics:
  - Term ledgers (Submit / OneYear / ThirtyDay): batch_clear_data_tx_included_height
  - Publish ledger: batch_clear_data_tx_promoted_height (existing)

Phase 2 (persist_metadata) and persist_block both called
batch_set_data_tx_included_height for every ledger, silently overwriting the
Submit-block height when processing a Publish block. Now gated:
  - Publish: set promoted_height only
  - Term ledgers: set included_height only

Adds clear_data_tx_included_height / batch_clear_data_tx_included_height
helpers to irys-database, mirroring the existing promoted_height variants.

Tests added:
- submit_to_publish_promotion_preserves_included_height
- oneyear_direct_inclusion_sets_included_height
- thirtyday_direct_inclusion_sets_included_height
- phase1_orphan_publish_only_preserves_submit_included_height
- phase1_orphan_submit_clears_included_height_and_deletes_row
- phase1_orphan_oneyear_clears_included_height
- phase1_orphan_thirtyday_clears_included_height
- clear_included_height_only_set_deletes_row (db_index)
- clear_included_height_preserves_promoted_height (db_index)
- clear_included_height_missing_row_is_noop (db_index)
- batch_clear_included_height_happy_path (db_index)
…_block_confirmed_updates

apply_block_confirmed_updates accepted `submit_and_publish_txids` and
unconditionally set `included_height` on every tx found in that slice,
including Publish-promoted txs. This overwrote the Submit-block height
recorded at first confirmation with the Publish-block height.

Fix: rename the parameter to `term_ledger_txids` and update the call-site
in handle_block_confirmed_message to build the slice from Submit +
OneYear + ThirtyDay only (excluding Publish). Publish txids continue to
flow through the existing phase-3 path which sets `promoted_height` only.

Also removes the misleading `overwrite = true` debug field from the log.

Tests added:
- publish_confirmation_does_not_overwrite_submit_included_height
- oneyear_direct_confirmation_sets_included_height
- thirtyday_direct_confirmation_sets_included_height
- publish_only_tx_gets_promoted_height_not_included_height
- success_path_applies_all_three_update_phases_atomically (updated)
…bug_asserts

P1-1: Swap find_canonical_ledger_range to prefer block_tree over migrated DB.
The tree is the live source of truth; DB is the long-term backstop.  Tree-first
eliminates timing edge cases at the migration boundary and is more defensive
against future DB metadata bugs.

P1-2: Add Submit-membership guard in lookup_via_migrated_metadata.  After P0-1,
included_height for OneYear/ThirtyDay txs points at their own term-ledger block.
If a future caller passes such a tx_id, compute_submit_range would return a
wrong Submit delta.  Mirror the guard already present in lookup_via_block_tree.

P3-1: Two debug_asserts for invariant documentation:
- block_migration_depth > 0 in lookup_via_block_tree
- prev.is_some() || height == 0 in compute_submit_range

Tests: tree_wins_over_corrupt_db_entry (P1-1), migrated_oneyear_only_tx_returns_none,
migrated_thirtyday_only_tx_returns_none, migrated_submit_tx_returns_correct_range_after_guard (P1-2).
…gress_proofs

CONSENSUS RULE CHANGE — must be coordinated with a network upgrade.

Old semantics: assigned_miners was set inside the per-proof loop, once per
intersecting slot found by iterating `slot_ranges: HashMap<usize, LedgerChunkRange>`
(non-deterministic RandomState). When a miner held multiple intersecting slots the
value written was whichever slot the HashMap yielded first. The value was also
re-overwritten on each proof iteration ("last write wins"), so the final result
depended on both proof ordering and HashMap randomness — a consensus-fork vector.

New semantics: assigned_miners = count of unique miner addresses across ALL
Submit-ledger slots whose chunk range intersects block_range. This is a pure
function of (block_range, epoch_snapshot) computed once, before the per-proof loop.
The per-proof loop now only classifies proofs as assigned-or-not.

Why the new semantics is correct:
  1. Deterministic: same result on every node regardless of HashMap key order or
     proof submission order.
  2. Matches the variable name and the intent of the caller (the
     expected_assigned_proofs clamp): "how many distinct miners are eligible to
     contribute an assigned proof?" is a property of the epoch, not the proof set.
  3. Robust: if the "same miner never assigned to the same slot twice" invariant
     is ever violated, we still get a correct unique-address count.

When block_range spans multiple slots whose miner sets overlap, the new value is
>= the old value (strictly larger if any miner covers more than one intersecting
slot). This is a consensus rule change.

Also subsumes the pre-existing "last write wins" latent bug in the old proof loop
(the rewrite naturally eliminates it).

Tests added:
- assigned_miners_multi_slot_unique_address_count (regression: same miner in two
  slots, must equal 1 unique address across 5 repeated calls)
- assigned_miners_single_slot_two_miners (single slot, two distinct miners,
  assigned_miners == 2 regardless of how many proofs submitted)
- assigned_miners_zero_proofs_returns_correct_counts (zero proofs still counts
  intersecting miners correctly; empty epoch yields 0)
HashSet iteration order varies across runs; when multiple orphans share the
same parent, the order in which they are submitted to the block pool was
non-deterministic. This makes transient-tip selection stable and prevents
the ordering from being observable via RPC or gossip before the next tip
change resolves it.

No unit test added — the effect is only observable at integration/network
level (which transient tip wins before the next canonical update). The fix
is a single sort_by_key call; correctness follows from H256 implementing
Ord deterministically.
…eSet

Switch `solutions: HashMap<H256, HashSet<BlockHash>>` to
`BTreeMap<H256, BTreeSet<BlockHash>>`.  BTreeSet iterates ascending by
BlockHash, so `get_by_solution_hash`'s "return first match" semantics
become a "return smallest BlockHash" tie-break automatically — no
explicit sort needed at the call site.

Previously the field was iterated in HashMap/HashSet order
(RandomState), so when multiple blocks shared a solution_hash and
qualified under the same arm (exact-cumdiff match, double-signing, or
fallback), which one was returned varied across nodes — a consensus-
fork vector for any future consensus-path caller.

No production callers exist today (the function is staged for double-
signing detection), but the `solutions` field has active writers in
`insert_block`/`remove_block` and a separate reader in
`is_known_solution_hash`, so the structure is being maintained for a
future reader.  Locking in determinism now means that reader can be
added without a hidden fork.

Companion regression tests assert stable selection across 100
iterations under both the exact-cumdiff-match arm and the
`best_block` fallback arm.
Collapse OneYear+ThirtyDay duplicates into rstest cases for both
direct_inclusion_sets_included_height and
phase1_orphan_term_ledger_clears_included_height.
…ty test

Splits the per-tx Publish proof clamp-and-check into a pure helper so unit
tests can exercise the production logic byte-for-byte instead of modelling it.
Adds `assigned_miners_rule_change_invisible_under_current_config`, which uses
that helper to prove the consensus rule change introduced in d6e0338 is
invisible under every deployed `ConsensusConfig` (mainnet/testnet/testing all
have `number_of_ingress_proofs_from_assignees = 0`):

- old/new `assigned_miners` values diverge (1 vs. 2 in the Alice-in-slot-1,
  Bob-in-slot-2 scenario), confirming the rule change is real
- under `from_assignees = 0` both values produce ACCEPT (clamp + count check
  are both no-ops), so today's network sees no observable behavior change
- under `from_assignees = 2` the values diverge in outcome (old ACCEPTS,
  new REJECTS with `AssignedProofCountMismatch { expected: 2, actual: 1 }`)

The test pins the "no separate hardfork gate required" conclusion: activation
of the new semantics is implicitly tied to the future hardfork that bumps
`from_assignees > 0`, which is itself a coordinated network upgrade.
@coderabbitai

coderabbitai Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: fc6a65fa-4dfe-458f-a5ee-0b0694a72223

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/assigned-miners-determinism

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant