feat(p2p): evict peers on chain_id (network) mismatch handshake rejection#1437
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughEvicts peers on outbound NetworkMismatch: adds a DB deletion helper, in-memory eviction by API address, stages deletions in the peer service and applies them in flush(), with tests and a design document. ChangesPeer eviction on network mismatch
Sequence DiagramsequenceDiagram
participant Initiator
participant announce_yourself_to_address
participant Responder
participant PeerList
participant Database
Initiator->>announce_yourself_to_address: send handshake
announce_yourself_to_address->>Responder: forward handshake
Responder->>announce_yourself_to_address: PeerResponse::Rejected(NetworkMismatch)
rect rgba(255, 100, 100, 0.5)
announce_yourself_to_address->>PeerList: remove_peer_by_api_address(api_address)
PeerList->>PeerList: clear lookup and index maps
PeerList->>announce_yourself_to_address: return Option<IrysPeerId>
announce_yourself_to_address->>announce_yourself_to_address: add id to pending_db_removals
announce_yourself_to_address->>Database: flush applies delete_peer_list_item(id)
Database->>announce_yourself_to_address: Result<bool>
end
announce_yourself_to_address->>Initiator: return PeerHandshakeRejected
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/p2p/src/peer_network_service.rs`:
- Around line 1105-1134: evict_peer_on_network_mismatch currently performs an
independent DB delete (db.update_scoped + delete_peer_list_item) which races
with the periodic flush (persistable_peers_with_mining_addr / flush), allowing
the flushed snapshot to resurrect the peer; change this so eviction is
serialized with the flush writer: do not run a standalone delete in
evict_peer_on_network_mismatch; instead mark/remove the peer in PeerList and
either (a) enqueue the peer_id for removal to the same flush/DB writer (or call
a new PeerList method that acquires the same lock used by flush and performs the
delete inside the flush's DB transaction), or (b) have flush consult a removal
set (updated by evict_peer_on_network_mismatch) and perform
delete_peer_list_item within its transaction; reference
evict_peer_on_network_mismatch, PeerList,
persistable_peers_with_mining_addr/flush, delete_peer_list_item and
db.update_scoped when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 06c910cf-cf5d-45bf-830d-63992fd7b58c
📒 Files selected for processing (4)
crates/database/src/database.rscrates/domain/src/models/peer_list.rscrates/p2p/src/peer_network_service.rsdocs/superpowers/specs/2026-06-02-gossip-handshake-enforcement-design.md
| ~50–70 LOC (eviction primitive + DB delete + one wiring arm) + tests. No wire | ||
| change, no persistence. |
There was a problem hiding this comment.
Clarify “no persistence” wording to avoid contradiction with DB eviction.
This line reads as if persistence is untouched, but the near-term fix explicitly deletes peers from PeerListItems in DB. Consider rewording to “no wire change, no new persisted state/schema” for precision.
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.
| Benchmark suite | Current: 9dd1ecf | Previous: 12614c1 | Ratio |
|---|---|---|---|
get_recall_range/100 |
0.015291 ms/iter (± 0.00114) |
0.012592 ms/iter (± 0.000486) |
1.21 |
get_recall_range/1000 |
0.159634 ms/iter (± 0.006973) |
0.120043 ms/iter (± 0.002043) |
1.33 |
This comment was automatically generated by workflow using github-action-benchmark.
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/p2p/src/peer_network_service.rs`:
- Around line 237-261: The code drains pending_db_removals (std::mem::take)
while holding the lock before calling db.update_scoped, which loses staged
removals if the transaction fails; instead, while holding the lock capture a
clone/copy of pending_db_removals (do not std::mem::take) and release the lock,
run update_scoped using that cloned set (removals_clone), then if update_scoped
succeeds reacquire self.state.lock().await and remove those successfully-applied
IDs from pending_db_removals (or clear them) so removals are only dropped on
successful transaction; refer to pending_db_removals, state.lock().await,
update_scoped, and the local removals variable to locate where to change the
logic.
In `@docs/superpowers/specs/2026-06-02-gossip-handshake-enforcement-design.md`:
- Around line 97-103: Update the docs to reflect that on outbound-announce
rejection (the PeerResponse::Rejected arm in announce_yourself_to_address) you
should still evict the peer from the in-memory cache using
remove_peer_by_api_address, but do NOT perform immediate DB deletion; instead
record the peer in pending_db_removals and let the DatabaseProvider-only writer
apply deletes during flush via db.update_scoped, preserving the existing error
return when rejected_response.reason ==
version::RejectionReason::NetworkMismatch.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: ad594948-d365-46ba-ac93-f7bc02bd9fd0
📒 Files selected for processing (4)
crates/database/src/database.rscrates/domain/src/models/peer_list.rscrates/p2p/src/peer_network_service.rsdocs/superpowers/specs/2026-06-02-gossip-handshake-enforcement-design.md
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
crates/p2p/src/peer_network_service.rs (2)
246-270:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftClose the remaining snapshot/staging race.
flush()still has a crash window: it takespending_db_removals, then snapshotspeer_list, but a concurrentNetworkMismatcheviction can land after that take and before this transaction commits. In that case this flush can write the stale snapshot back toPeerListItems, and if the node exits before the next flush the peer is reloaded on restart.This needs one atomic view of both the peer snapshot and the staged removals. The simplest shape here is to serialize
peer_list.remove_peer_by_api_address(...)+pending_db_removals.insert(...)with the samestatemutex that guards the flush snapshot/take path.Possible fix shape
async fn flush(&self) -> Result<(), PeerListServiceError> { - let (db, removals) = { + let (db, removals, persistable_peers) = { let mut state = self.state.lock().await; ( state.db.clone(), std::mem::take(&mut state.pending_db_removals), + self.peer_list.persistable_peers_with_mining_addr(), ) }; - - let persistable_peers = self.peer_list.persistable_peers_with_mining_addr(); let result = db .update_scoped(|tx| { for (peer_id, peer) in persistable_peers.iter() { if removals.contains(peer_id) { continue; @@ - if let Some(peer_id) = - evict_peer_on_network_mismatch(&peer_list, api_address, &rejected_response) - { - inner.state.lock().await.pending_db_removals.insert(peer_id); + let mut state = inner.state.lock().await; + if let Some(peer_id) = + evict_peer_on_network_mismatch(&peer_list, api_address, &rejected_response) + { + state.pending_db_removals.insert(peer_id); }Also applies to: 1119-1123
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/p2p/src/peer_network_service.rs` around lines 246 - 270, The flush() path can race with concurrent NetworkMismatch evictions because the peer snapshot (persistable_peers) and staged removals (pending_db_removals / removals) are taken separately; to fix, make the snapshot and the staging of evictions atomic under the same state mutex so you cannot re-persist a peer that was evicted after the snapshot: acquire self.state.lock() around the sequence that calls peer_list.remove_peer_by_api_address(...) / pending_db_removals.insert(...) and the code that creates persistable_peers and removals, then release the lock before calling db.update_scoped(...) (leaving transaction logic unchanged); apply the same locking pattern where peer removals are staged (the peer_list.remove_peer_by_api_address and pending_db_removals.insert sites) so the update_scoped(tx -> insert_peer_list_item / delete_peer_list_item) always operates on a consistent snapshot.
1236-1236:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse
#[tokio::test]on these async tests instead of importingtokio::testand using#[test].In
crates/p2p/src/peer_network_service.rs, theasync fntests are currently annotated with#[test]afteruse tokio::test;, but the repo guideline requires using the explicit#[tokio::test]attribute for async tests:
network_mismatch_rejection_evicts_from_cache_and_returns_idnon_network_rejection_retains_peerflush_deletes_pending_removals_without_resurrectingAs per coding guidelines: “Use
#[tokio::test]for async tests”.Also applies to: 1327-1328, 1363-1364, 1396-1397
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/p2p/src/peer_network_service.rs` at line 1236, The async tests network_mismatch_rejection_evicts_from_cache_and_returns_id, non_network_rejection_retains_peer, and flush_deletes_pending_removals_without_resurrecting (and the other occurrences around 1327-1328, 1363-1364, 1396-1397) should use the #[tokio::test] attribute instead of importing tokio::test and annotating with #[test]; update each test function to replace #[test] with #[tokio::test], remove the standalone use tokio::test; import (or any unused import) if present, and ensure the async fn signatures remain unchanged so the tests run under the Tokio runtime.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/p2p/src/peer_network_service.rs`:
- Around line 246-270: The flush() path can race with concurrent NetworkMismatch
evictions because the peer snapshot (persistable_peers) and staged removals
(pending_db_removals / removals) are taken separately; to fix, make the snapshot
and the staging of evictions atomic under the same state mutex so you cannot
re-persist a peer that was evicted after the snapshot: acquire self.state.lock()
around the sequence that calls peer_list.remove_peer_by_api_address(...) /
pending_db_removals.insert(...) and the code that creates persistable_peers and
removals, then release the lock before calling db.update_scoped(...) (leaving
transaction logic unchanged); apply the same locking pattern where peer removals
are staged (the peer_list.remove_peer_by_api_address and
pending_db_removals.insert sites) so the update_scoped(tx ->
insert_peer_list_item / delete_peer_list_item) always operates on a consistent
snapshot.
- Line 1236: The async tests
network_mismatch_rejection_evicts_from_cache_and_returns_id,
non_network_rejection_retains_peer, and
flush_deletes_pending_removals_without_resurrecting (and the other occurrences
around 1327-1328, 1363-1364, 1396-1397) should use the #[tokio::test] attribute
instead of importing tokio::test and annotating with #[test]; update each test
function to replace #[test] with #[tokio::test], remove the standalone use
tokio::test; import (or any unused import) if present, and ensure the async fn
signatures remain unchanged so the tests run under the Tokio runtime.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: cad070b9-92b2-49e9-9c8b-d26c081ba44b
📒 Files selected for processing (1)
crates/p2p/src/peer_network_service.rs
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/p2p/src/peer_network_service.rs (1)
1120-1138:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't mark rejected handshakes as successful announcements.
By this branch, Line 1050 has already sent
AnnouncementFinished { success: true }, so a rejected handshake still lands insuccessful_announcements. Line 526 then suppresses fresh non-forced handshakes to the same API address forSUCCESSFUL_ANNOUNCEMENT_CACHE_TTL, even though this path just evicted it. Only emit the success notification afterPeerResponse::Accepted; send a non-retry failure forPeerResponse::Rejectedinstead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/p2p/src/peer_network_service.rs` around lines 1120 - 1138, The code currently emits AnnouncementFinished { success: true } before handling PeerResponse::Rejected, causing rejected handshakes to be recorded in successful_announcements; instead, only emit the success announcement after matching PeerResponse::Accepted and for PeerResponse::Rejected emit a failure (non-retry) announcement so the successful_announcements cache is not populated for evicted peers. Concretely, move the AnnouncementFinished { success: true } emission out of the common pre-response path and into the PeerResponse::Accepted branch, and in the PeerResponse::Rejected branch send a failure announcement (with success: false or a distinct non-retry event) before calling evict_peer_on_network_mismatch and inserting into state.pending_db_removals; update any handlers that expect the announcement shape accordingly (references: AnnouncementFinished, successful_announcements, PeerResponse::Accepted, PeerResponse::Rejected, evict_peer_on_network_mismatch, pending_db_removals).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/p2p/src/peer_network_service.rs`:
- Around line 1120-1138: The code currently emits AnnouncementFinished {
success: true } before handling PeerResponse::Rejected, causing rejected
handshakes to be recorded in successful_announcements; instead, only emit the
success announcement after matching PeerResponse::Accepted and for
PeerResponse::Rejected emit a failure (non-retry) announcement so the
successful_announcements cache is not populated for evicted peers. Concretely,
move the AnnouncementFinished { success: true } emission out of the common
pre-response path and into the PeerResponse::Accepted branch, and in the
PeerResponse::Rejected branch send a failure announcement (with success: false
or a distinct non-retry event) before calling evict_peer_on_network_mismatch and
inserting into state.pending_db_removals; update any handlers that expect the
announcement shape accordingly (references: AnnouncementFinished,
successful_announcements, PeerResponse::Accepted, PeerResponse::Rejected,
evict_peer_on_network_mismatch, pending_db_removals).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: b943b519-7745-4db1-9080-4c8ea1e6c072
📒 Files selected for processing (1)
crates/p2p/src/peer_network_service.rs
281ebbb to
b5c80d6
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/domain/src/models/peer_list.rs`:
- Around line 1440-1480: Update the test
remove_peer_by_api_address_clears_all_lookups to also verify the gossip lookup
and emitted removal event: after calling
peer_list.remove_peer_by_api_address(&api), assert that
peer_list.peer_by_gossip_address(&peer.address.gossip) (or the appropriate
gossip address field) returns None, and assert that the PeerRemoved event was
emitted (check the event stream/queue or peer_list.recent_events()/last_event()
for a PeerRemoved containing the removed peer_id); reference the
functions/methods peer_by_gossip_address, remove_peer_by_api_address and the
PeerRemoved event type to locate where to add these assertions.
In `@crates/p2p/src/peer_network_service.rs`:
- Around line 1120-1142: The code currently treats any HTTP-level
Ok(peer_response) as an announcement success, which lets NetworkMismatch
rejections remain recorded in successful_announcements; instead, only emit the
AnnouncementFinished { success: true, retry: false } notification from the
PeerResponse::Accepted arm and for PeerResponse::Rejected return/report an
announcement with success: false, retry: false (so handle_handshake_request()
won't cache rejected peers in successful_announcements). Update the
PeerResponse::Rejected branch (where evict_peer_on_network_mismatch(...) and
pending_db_removals.insert(...) are done) to report failure (success:false,
retry:false) and keep returning
PeerListServiceError::PeerHandshakeRejected(rejected_response), and move the
success notification logic into the PeerResponse::Accepted branch; ensure
references to successful_announcements handling and any AnnouncementFinished
emission are adjusted accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 58c48eb6-1246-4a22-bfda-c8397c8d3670
📒 Files selected for processing (4)
crates/database/src/database.rscrates/domain/src/models/peer_list.rscrates/p2p/src/peer_network_service.rsdocs/superpowers/specs/2026-06-02-gossip-handshake-enforcement-design.md
…tion A handshake rejected for a chain_id mismatch (mapped to NetworkMismatch) only recorded a failed announcement and left the peer in the cache. The gossip data plane (check_peer_v*) trusts cache membership rather than handshake outcome, so the rejected peer stayed fully trusted and the node kept exchanging gossip with a different network. Now a NetworkMismatch handshake rejection evicts the peer from the in-memory cache (all lookup maps) and deletes it from the persistent peer DB, so a node re-announcing to its cached peers while on the wrong chain is isolated after the startup announce round. - domain: PeerList::remove_peer_by_api_address removes a peer from the cache and every index map, emitting PeerRemoved. - database: delete_peer_list_item removes a peer from the PeerListItems table. - p2p: evict_peer_on_network_mismatch wired into the outbound-announce rejection path; no-op for any non-network rejection reason. Includes a design doc covering the diagnosis and the larger follow-up (session-scoped handshake + authenticated handshake response).
… resurrect race evict_peer_on_network_mismatch deleted the peer in its own db.update_scoped transaction, which raced with the periodic flush: if flush had already snapshotted persistable_peers before the eviction removed the peer from the cache, flush's insert re-inserted the just-deleted peer, resurrecting it in the DB (reloaded on the next restart). Make flush the sole peer-DB writer. Eviction now removes the peer from the in-memory cache and stages its peer_id in PeerNetworkServiceState.pending_db_removals; flush drains the set and, within its single transaction, skips re-inserting any staged peer and deletes them. The in-memory eviction stays immediate, so the gossip data plane stops trusting the peer at once.
flush() discarded the inner transaction Result via `let _ =`, silently dropping insert/delete failures (it could even report success after a failed delete), and it `mem::take`d pending_db_removals before the write so a failed flush lost the staged chain-mismatch eviction deletes entirely. Flatten the nested update_scoped result so inner PeerListServiceError values propagate, and re-stage removals (merge, to preserve evictions staged during the lock-free write window) when the flush fails so the next flush retries. Deletes are idempotent, so the retry is safe.
flush() snapshotted persistable_peers (peer-list lock) and drained pending_db_removals (state lock) separately, so a concurrent NetworkMismatch eviction could be observed half-applied and a just-evicted peer re-persisted. Take both under the same state lock, and perform the eviction's cache removal and removal staging under that same lock, so flush always sees a consistent (snapshot, removals) pair. Lock ordering is state -> peer_list on both paths (peer-list methods never acquire the state lock), so no deadlock. The lock is released before the DB write; an eviction during that lock-free write can still re-persist a peer for one flush cycle, which is benign (the in-memory eviction is already in effect and the next flush deletes it).
The near-term design's step 3 described an inline DB delete in the eviction path. The implementation defers the delete: eviction removes the peer from the cache and stages its id in pending_db_removals (under the same state lock flush takes), and flush — the sole peer-DB writer — applies delete_peer_list_item in its transaction. Update step 3 to match.
…uncements
announce_yourself_to_address sent AnnouncementFinished{success:true} for any
transport-level Ok response, including PeerResponse::Rejected, so a rejection
(e.g. NetworkMismatch) was cached in successful_announcements and suppressed
future announce attempts as if it had succeeded.
Move the success notification into the Accepted arm; the Rejected arm now reports
success:false, retry:false (terminal — re-announcing won't change the peer's
mind) before evicting on NetworkMismatch and returning PeerHandshakeRejected.
…viction remove_peer_by_api_address_clears_all_lookups now also asserts the gossip-address index returns None after eviction and that a PeerRemoved event carrying the evicted peer_id is emitted (subscribed before the removal).
b5c80d6 to
9a2f23a
Compare
Problem
PR #1435 hard-rejects handshakes from a different
chain_id, but a node configured for a different chain still kept communicating on devnet. Root cause: the chain_id check lives only in the handshake, while the gossip data plane (check_peer_v*) authorizes purely on cache membership + source IP — it never consults handshake outcome.When our outbound announce is rejected for a chain mismatch (
ChainIdMismatch→NetworkMismatch), the rejection was only recorded infailed_announcements— a map that is written but never read. The peer stayed in the cache and fully trusted, so gossip kept flowing.Fix
A
NetworkMismatchhandshake rejection now evicts the peer from the in-memory cache (all lookup maps) and deletes it from the persistent peer DB. A node re-announcing to its cached peers while on the wrong chain is therefore isolated after the startup announce round (every upgraded peer rejects it → each is evicted → empty peer set).This relies on peers enforcing #1435, which is acceptable since the target devnet is fully upgraded. Self-enforcing variants and the larger "always require a session handshake + authenticated handshake response" work are scoped in the included design doc for a follow-up.
Changes
peer_list.rs):PeerList::remove_peer_by_api_address— removes a peer from the persistent cache/purgatory and every index map, emittingPeerRemoved(mirrors the existing purgatory-eviction cleanup).database.rs):delete_peer_list_item— removes a peer from thePeerListItemstable (mirror ofinsert_peer_list_item).peer_network_service.rs):evict_peer_on_network_mismatchwired into the outbound-announceRejectedarm; no-op for any non-network rejection reason.Test plan
Added (TDD, red→green):
remove_peer_by_api_address_clears_all_lookups(domain)delete_peer_list_item_removes_it(database)network_mismatch_rejection_evicts_peer_from_cache_and_db(p2p)non_network_rejection_retains_peer(p2p)cargo fmt --allclean;cargo clippy -p irys-p2p -p irys-domain -p irys-database --tests --all-targetsclean. Existing rejection-coverage tests still pass. Full-workspacecargo xtask testnot yet run.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests