Skip to content

fix: size cache poisoning and tx offset validation#1045

Merged
glottologist merged 15 commits into
masterfrom
jason/data_size_cache_poisoning_rework
Dec 9, 2025
Merged

fix: size cache poisoning and tx offset validation#1045
glottologist merged 15 commits into
masterfrom
jason/data_size_cache_poisoning_rework

Conversation

@glottologist

@glottologist glottologist commented Dec 4, 2025

Copy link
Copy Markdown
Contributor

Describe the changes
Before:
Chunk ingress rejected chunks when the cached data_size was smaller than the chunk's claimed size, even if the cached value was unverified. Attackers could poison the cache with small data_size transactions, causing legitimate chunks to be rejected. Pre-header cache limiting used tx_offset directly, allowing attackers to evict legitimate entries by submitting high-offset chunks.

After:
Tracks data_size_confirmed flag to distinguish verified vs unverified cache entries. Unconfirmed size mismatches park chunks rather than rejecting them. Pre-header cache uses count-based limiting instead of offset-based. Adds Merkle proof verification of rightmost chunks to confirm data_size.

Changes

Chunk Validation (crates/types/src/chunk.rs)

  • Added max_valid_offset() to compute maximum valid tx_offset for a data_size
  • Added is_valid_offset() for bounds checking tx_offset against data_size
  • Added end_byte_offset_checked() with overflow protection
  • Added rstest parameterised tests for offset validation edge cases

Mempool Chunk Ingress (crates/actors/src/mempool_service/chunks.rs)

  • Track data_size_confirmed from cached data root
  • Fall back to storage modules for data_size lookup with publish/submit ledger distinction - publish is trusted, submit is attempted to be verified.
  • Park chunks when unconfirmed data_size is smaller than the chunk's claimed size
  • Added verify_data_size_from_storage_modules() for merkle proof-based verification
  • Validate tx_offset bounds before merkle path validation
  • Change pre-header limiting from offset-based to count-based
  • Added InvalidOffset(String) error variant with context

Mempool State (crates/actors/src/mempool_service.rs, pending_chunks.rs, mempool_guard.rs)

  • Added pending_chunk_count_for_data_root() method
  • Added get() method to PriorityPendingChunks for read-only access

Tests (crates/chain/tests/)

  • Updated test_overlapping_data_sizes to expect parking instead of rejection
  • Renamed preheader_rejects_out_of_cap_tx_offset to preheader_rejects_when_cache_full
  • Test now verifies count-based limiting behaviour

Related Issue(s)
Please link to the issue(s) that will be closed with this PR.

Checklist

  • Tests have been added/updated for the changes.
  • Documentation has been updated for the changes (if applicable).
  • The code follows Rust's style guidelines.

Additional Context
Add any other context about the pull request here.

@glottologist glottologist changed the title Jason/data size cache poisoning rework fix: size cache poisoning and tx offset validation Dec 4, 2025
@glottologist glottologist marked this pull request as ready for review December 4, 2025 12:04
@glottologist glottologist requested review from DanMacDonald, JesseTheRobot and antouhou and removed request for JesseTheRobot December 4, 2025 12:04
@glottologist glottologist force-pushed the jason/data_size_cache_poisoning_rework branch from 9f3eaee to f1dbfa4 Compare December 9, 2025 09:13

@JesseTheRobot JesseTheRobot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good, but I do have a couple blocking change requests

for info in &infos.0 {
let current_max = sm_data_size.unwrap_or(0);
if info.data_size > current_max {
sm_data_size = Some(info.data_size);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a flaw here: if a submit slot has a larger data_size for a data_root, and then we find a data_size on a publish ledger SM, we will trust the larger invalid size

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in a39be0a

let mut sm_data_size: Option<u64> = None;
let mut from_publish_ledger = false;

for sm in storage_modules.iter() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably run these operations in parallel - on the larger nodes (with 40+ drives) doing these checks drive-by-drive might cause notable latency, and each drive is it's own I/O domain so there shouldn't be any issues with doing this. we should also parallelise verify_data_size_from_storage_modules for the same reason.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I am fine with this being a followup if required, but I would request we instrument this iteration & collect_data_root_infos in this PR so we can see the timing info in our telemetry so we can see if it start becoming an issue)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 26025d6

}

let num_chunks = data_size.div_ceil(chunk_size);
let max_valid_offset = num_chunks.saturating_sub(1);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this calculation is correct, but it might be worth double-checking/extracting this to a single source-of-truth function

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh you did move it to a function - can you extract it from UnpackedChunk and then change this code to use it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in da52a52

_ => continue,
};

for info in &infos.0 {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this inner loop we can leave as-is, the outer loop is the only one that would massively benefit from parallelisation


let partition_offset =
irys_types::PartitionChunkOffset::from(relative_offset as u32);
let chunk = match sm.generate_full_chunk(partition_offset) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit(perf): we just need the data_path, which we can resolve without generating the full chunk (or at least without having to read in the bytes)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to extract that metadata logic from generate_full_chunk into a seperate helper function so that you can use it here (can be a followup)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in e5ff113

)
.into()),
CriticalChunkIngressError::InvalidOffset(ref msg) => {
Ok(HttpResponse::build(StatusCode::BAD_REQUEST)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to use the ApiStatusResponse / ApiError as it gives canonical JSON body formatting for errors (applies for all the HttpResponse's in this file)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 4e28ddd

@JesseTheRobot JesseTheRobot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glottologist glottologist merged commit 420b137 into master Dec 9, 2025
17 checks passed
@glottologist glottologist deleted the jason/data_size_cache_poisoning_rework branch December 9, 2025 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants