fix: size cache poisoning and tx offset validation#1045
Conversation
9f3eaee to
f1dbfa4
Compare
JesseTheRobot
left a comment
There was a problem hiding this comment.
overall looks good, but I do have a couple blocking change requests
| for info in &infos.0 { | ||
| let current_max = sm_data_size.unwrap_or(0); | ||
| if info.data_size > current_max { | ||
| sm_data_size = Some(info.data_size); |
There was a problem hiding this comment.
there is a flaw here: if a submit slot has a larger data_size for a data_root, and then we find a data_size on a publish ledger SM, we will trust the larger invalid size
| let mut sm_data_size: Option<u64> = None; | ||
| let mut from_publish_ledger = false; | ||
|
|
||
| for sm in storage_modules.iter() { |
There was a problem hiding this comment.
we should probably run these operations in parallel - on the larger nodes (with 40+ drives) doing these checks drive-by-drive might cause notable latency, and each drive is it's own I/O domain so there shouldn't be any issues with doing this. we should also parallelise verify_data_size_from_storage_modules for the same reason.
There was a problem hiding this comment.
(I am fine with this being a followup if required, but I would request we instrument this iteration & collect_data_root_infos in this PR so we can see the timing info in our telemetry so we can see if it start becoming an issue)
| } | ||
|
|
||
| let num_chunks = data_size.div_ceil(chunk_size); | ||
| let max_valid_offset = num_chunks.saturating_sub(1); |
There was a problem hiding this comment.
I think this calculation is correct, but it might be worth double-checking/extracting this to a single source-of-truth function
There was a problem hiding this comment.
oh you did move it to a function - can you extract it from UnpackedChunk and then change this code to use it?
| _ => continue, | ||
| }; | ||
|
|
||
| for info in &infos.0 { |
There was a problem hiding this comment.
this inner loop we can leave as-is, the outer loop is the only one that would massively benefit from parallelisation
|
|
||
| let partition_offset = | ||
| irys_types::PartitionChunkOffset::from(relative_offset as u32); | ||
| let chunk = match sm.generate_full_chunk(partition_offset) { |
There was a problem hiding this comment.
nit(perf): we just need the data_path, which we can resolve without generating the full chunk (or at least without having to read in the bytes)
There was a problem hiding this comment.
feel free to extract that metadata logic from generate_full_chunk into a seperate helper function so that you can use it here (can be a followup)
| ) | ||
| .into()), | ||
| CriticalChunkIngressError::InvalidOffset(ref msg) => { | ||
| Ok(HttpResponse::build(StatusCode::BAD_REQUEST) |
There was a problem hiding this comment.
we need to use the ApiStatusResponse / ApiError as it gives canonical JSON body formatting for errors (applies for all the HttpResponse's in this file)
Describe the changes
Before:
Chunk ingress rejected chunks when the cached
data_sizewas smaller than the chunk's claimed size, even if the cached value was unverified. Attackers could poison the cache with smalldata_sizetransactions, causing legitimate chunks to be rejected. Pre-header cache limiting usedtx_offsetdirectly, allowing attackers to evict legitimate entries by submitting high-offset chunks.After:
Tracks
data_size_confirmedflag to distinguish verified vs unverified cache entries. Unconfirmed size mismatches park chunks rather than rejecting them. Pre-header cache uses count-based limiting instead of offset-based. Adds Merkle proof verification of rightmost chunks to confirmdata_size.Changes
Chunk Validation (
crates/types/src/chunk.rs)max_valid_offset()to compute maximum valid tx_offset for a data_sizeis_valid_offset()for bounds checking tx_offset against data_sizeend_byte_offset_checked()with overflow protectionMempool Chunk Ingress (
crates/actors/src/mempool_service/chunks.rs)data_size_confirmedfrom cached data rootverify_data_size_from_storage_modules()for merkle proof-based verificationInvalidOffset(String)error variant with contextMempool State (
crates/actors/src/mempool_service.rs,pending_chunks.rs,mempool_guard.rs)pending_chunk_count_for_data_root()methodget()method toPriorityPendingChunksfor read-only accessTests (
crates/chain/tests/)test_overlapping_data_sizesto expect parking instead of rejectionpreheader_rejects_out_of_cap_tx_offsettopreheader_rejects_when_cache_fullRelated Issue(s)
Please link to the issue(s) that will be closed with this PR.
Checklist
Additional Context
Add any other context about the pull request here.