feat: obs follow-ups and Reth metrics bind hardening#1426
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThis PR establishes comprehensive observability infrastructure: block-tree lifecycle timestamps, block removal categorization, mempool anchor-resolution monitoring, block-pool rejection metrics, metrics configuration, a ChangesBlock lifecycle and validation observability
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/p2p/src/block_pool.rs (1)
769-773:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winMove orphan-retry metric increment after successful dispatch.
Line 769 increments
record_block_pool_orphan_retry()before attemptingsend(...). If the send fails, the metric still counts a retry that was never dispatched. Increment only after a successful send to keep this counter accurate.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/p2p/src/block_pool.rs` around lines 769 - 773, The orphan-retry metric is incremented before the dispatch, which can falsely count failed sends; move the call to crate::metrics::record_block_pool_orphan_retry() so it runs only after self.sync_service_sender.send(...) succeeds. Concretely, in the block handling SyncChainServiceMessage::AttemptReprocessingBlock(prev_block_hash), perform the send via self.sync_service_sender.send(...), check the Result (the Err branch should log the error as before), and call crate::metrics::record_block_pool_orphan_retry() only in the Ok/success path (after a successful send) to ensure the metric reflects dispatched retries.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/actors/benches/anchor_resolution.md`:
- Line 8: The metric name in this document uses the underscored form
irys_mempool_pending_submit_unresolvable_anchors but should use the canonical
dotted form irys.mempool.pending_submit_unresolvable_anchors; update every
occurrence (including the other occurrence referenced) to the dotted form so
queries and dashboards remain consistent, e.g., replace
irys_mempool_pending_submit_unresolvable_anchors with
irys.mempool.pending_submit_unresolvable_anchors wherever it appears.
In `@crates/actors/src/block_tree_service.rs`:
- Around line 99-103: The now_ms function uses an unchecked cast d.as_millis()
as i64; replace it with a checked conversion using i64::try_from (or TryInto) to
avoid truncation/overflow: after obtaining the Duration from
SystemTime::now().duration_since(UNIX_EPOCH) convert d.as_millis() with
i64::try_from(...).ok() (or .and_then(...)) and fall back to 0 on error, keeping
the same unwrap_or(0) semantics; update the conversion in the now_ms function to
use these checked conversions instead of the as cast.
In `@crates/actors/src/mempool_service.rs`:
- Around line 2515-2516: The monitoring interval in mempool_service.rs uses
tokio::time::interval (variable "interval") with the default Burst behavior;
change it to use a safer missed-tick behavior (e.g., MissedTickBehavior::Skip or
::Delay) to avoid rapid catch-up executions — create the interval as before then
call interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip)
(or ::Delay) before the initial interval.tick().await; reference the "interval"
variable and the tokio::time::MissedTickBehavior enum when applying the change.
- Line 2520: The cast `count as i64` is unsafe for runtime `usize` values;
replace it with a checked conversion using `i64::try_from(count)` and handle the
`Result` (for example, on `Err` cap to `i64::MAX` and emit a warning) before
calling `crate::metrics::set_anchor_unresolvable`; update the usage around
`set_anchor_unresolvable` and the `count` variable to perform `try_from` and a
clear error/overflow handling path instead of the `as` cast.
- Around line 2514-2522: The anchor-resolution sweep task currently spawned via
handle_for_inner.spawn (assigned to sweep_handle) lacks logging; add structured
logs: emit an info/debug at task startup indicating the sweep loop has started
(include interval duration), and emit periodic debug/info logs inside the loop
(e.g., after calling sweep_inner.count_unresolvable_submit_anchors()) that
report the retrieved count and any unexpected errors; reference the spawn block
(sweep_handle), the sweep_inner.count_unresolvable_submit_anchors() call, and
the crate::metrics::set_anchor_unresolvable() call so logs surround the metric
update and surface failures or anomalous counts.
In `@crates/reth-node-bridge/src/node.rs`:
- Around line 267-279: The tests are using NodeConfig::testnet() but should use
the testing constructor; update the test setup to call
irys_types::NodeConfig::testing() (and any ConsensusConfig::testing() if
present) instead of NodeConfig::testnet() so compute_metrics_addr(&cfg, ...)
gets a test-config; locate the occurrences in the tests around
compute_metrics_addr and replace NodeConfig::testnet() with
NodeConfig::testing().
---
Outside diff comments:
In `@crates/p2p/src/block_pool.rs`:
- Around line 769-773: The orphan-retry metric is incremented before the
dispatch, which can falsely count failed sends; move the call to
crate::metrics::record_block_pool_orphan_retry() so it runs only after
self.sync_service_sender.send(...) succeeds. Concretely, in the block handling
SyncChainServiceMessage::AttemptReprocessingBlock(prev_block_hash), perform the
send via self.sync_service_sender.send(...), check the Result (the Err branch
should log the error as before), and call
crate::metrics::record_block_pool_orphan_retry() only in the Ok/success path
(after a successful send) to ensure the metric reflects dispatched retries.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 220de0ce-086c-4a47-bd2f-5c4d9c972251
📒 Files selected for processing (19)
crates/actors/benches/anchor_resolution.mdcrates/actors/src/block_tree_service.rscrates/actors/src/block_validation.rscrates/actors/src/mempool_service.rscrates/actors/src/mempool_service/lifecycle.rscrates/actors/src/metrics.rscrates/api-server/src/lib.rscrates/api-server/src/routes/mod.rscrates/api-server/src/routes/tip.rscrates/chain/src/chain.rscrates/config/templates/mainnet_config.tomlcrates/config/templates/testnet_config.tomlcrates/domain/src/models/block_tree.rscrates/p2p/src/block_pool.rscrates/p2p/src/metrics.rscrates/reth-node-bridge/src/node.rscrates/types/src/config/mod.rscrates/types/src/config/node.rsdocs/99-reference/05-tracing-guidelines.md
95ef690 to
b8cafa8
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/actors/src/metrics.rs`:
- Around line 133-139: The helpers record_block_tree_last_block_at_ms and
record_block_tree_last_reorg_at_ms (and the other similar helpers around lines
271-273) use runtime `as u64` casts; replace those casts with a checked
conversion using u64::try_from (or TryInto) on the non-negative i64 value and
handle the Result explicitly (e.g., map Err to 0 or log and return) before
calling .record(...), so you avoid unchecked `as` casts in production metric
helpers and update the same pattern in the other functions mentioned.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c031b9aa-9223-45cb-a342-f81a91ea0658
📒 Files selected for processing (19)
crates/actors/benches/anchor_resolution.mdcrates/actors/src/block_tree_service.rscrates/actors/src/mempool_service.rscrates/actors/src/mempool_service/lifecycle.rscrates/actors/src/metrics.rscrates/api-server/src/lib.rscrates/api-server/src/routes/mod.rscrates/api-server/src/routes/tip.rscrates/chain/src/chain.rscrates/config/templates/mainnet_config.tomlcrates/config/templates/testnet_config.tomlcrates/domain/src/models/block_tree.rscrates/p2p/src/block_pool.rscrates/p2p/src/metrics.rscrates/reth-node-bridge/Cargo.tomlcrates/reth-node-bridge/src/node.rscrates/types/src/config/mod.rscrates/types/src/config/node.rsdocs/99-reference/05-tracing-guidelines.md
97aefeb to
36373d0
Compare
|
Benchmark results: https://irys-xyz.github.io/irys/dev/bench/1426%2Fmerge/index.html |
Describe the changes
Before: Observability gaps in block-tree/mempool/block-pool surfaced by the 2026-05-23 fork investigation; Reth Prometheus endpoint bound 0.0.0.0 by default.
After: New OTEL counters/gauges, /v1/tip endpoint, and a [metrics] bind_ip config section defaulting to 127.0.0.1.
Related Issue(s)
Please link to the issue(s) that will be closed with this PR.
Checklist
Additional Context
Add any other context about the pull request here.
Summary by CodeRabbit
Release Notes
New Features
/v1/tipAPI endpoint returning current block information with lifecycle metrics.Chores
Documentation