feat: obs follow-ups and Reth metrics bind hardening by glottologist · Pull Request #1426 · Irys-xyz/irys

glottologist · 2026-05-26T14:35:05Z

Describe the changes
Before: Observability gaps in block-tree/mempool/block-pool surfaced by the 2026-05-23 fork investigation; Reth Prometheus endpoint bound 0.0.0.0 by default.

After: New OTEL counters/gauges, /v1/tip endpoint, and a [metrics] bind_ip config section defaulting to 127.0.0.1.

Related Issue(s)
Please link to the issue(s) that will be closed with this PR.

Checklist

Tests have been added/updated for the changes.
Documentation has been updated for the changes (if applicable).
The code follows Rust's style guidelines.

Additional Context
Add any other context about the pull request here.

Summary by CodeRabbit

Release Notes

New Features
- Added block lifecycle tracking to record canonical block advances and reorganizations with timestamps.
- Introduced /v1/tip API endpoint returning current block information with lifecycle metrics.
- Added monitoring for unresolvable submit anchors in the mempool.
Chores
- Implemented OpenTelemetry metrics for block lifecycle timestamps and mempool monitoring.
- Added block pool rejection and retry tracking metrics.
- Updated configuration templates with metrics endpoint settings.
Documentation
- Added debugging guidelines for error tracing patterns.

coderabbitai · 2026-05-26T14:35:26Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6d03bdb6-a3ea-47cf-86d1-9a4174d3fdfe

📥 Commits

Reviewing files that changed from the base of the PR and between 97aefeb and 36373d0.

📒 Files selected for processing (6)

crates/actors/src/block_tree_service.rs
crates/actors/src/mempool_service.rs
crates/actors/src/metrics.rs
crates/p2p/src/block_pool.rs
crates/reth-node-bridge/Cargo.toml
crates/reth-node-bridge/src/node.rs

📝 Walkthrough

Walkthrough

This PR establishes comprehensive observability infrastructure: block-tree lifecycle timestamps, block removal categorization, mempool anchor-resolution monitoring, block-pool rejection metrics, metrics configuration, a /v1/tip API endpoint, and required service wiring across the node.

Changes

Block lifecycle and validation observability

Layer / File(s)	Summary
Metrics configuration types and defaults `crates/types/src/config/node.rs`, `crates/types/src/config/mod.ts`, `crates/config/templates/*`	New `MetricsConfig` struct with loopback default `127.0.0.1`; added to `NodeConfig` with serde defaulting; fixtures and integration tests verify defaults and overrides; config templates updated for mainnet and testnet.
Metrics declarations and recording helpers `crates/actors/src/metrics.rs`, `crates/p2p/src/metrics.rs`	Block-tree lifecycle gauges (`last_block_at_ms`, `last_reorg_at_ms`), mempool anchor-unresolvable gauge, and block-pool rejection/orphan-retry counters declared; `pub(crate)` recording helpers with i64-to-u64 clamping and unit tests.
Block-tree lifecycle timestamp tracking `crates/actors/src/block_tree_service.rs`	New `BlockTreeLifecycleTimestamps` struct with atomic unix-ms fields and public accessors; wired into `BlockTreeServiceInner` and `spawn_service`; `note_reorg()` and `note_canonical_advance()` called at respective events; unit tests verify timestamp updates from zero initial state.
Block cache removal reason tracking `crates/domain/src/models/block_tree.rs`	New `RemovalReason` enum with variants for validation invalid, prune depth, and recursive descendants; `as_log_value()` provides stable structured-log output including parent-hash for recursive removals; `remove_block` signature updated to require reason; prune and validation paths updated; unit tests validate reason strings and parent-hash propagation.
Mempool anchor-resolution monitoring `crates/actors/src/mempool_service.rs`, `crates/actors/src/mempool_service/lifecycle.rs`	New `count_unresolvable_submit_anchors()` method counts pending Submit txs with anchors resolving to `Ok(None)`, treating resolution errors as resolvable; 30-second tokio interval sweep spawned in `spawn_service`, skipping first tick, updating gauge; sweep aborted on startup failure.
Block pool rejection metrics `crates/p2p/src/block_pool.rs`	Labeled rejection metrics emitted for `already_processed`, `part_of_pruned_fork`, `reth_payload_unavailable`, `fatal_cache_corruption`, `parent_not_in_cache`, `internal_prevalidation_failure`, `block_validation_error`, and `try_reprocess_finalized` paths in `process_block` and `check_block_status`; orphan-retry metric recorded on successful reprocess dispatch.
GET /v1/tip API endpoint `crates/api-server/src/routes/tip.rs`, `crates/api-server/src/routes/mod.rs`, `crates/api-server/src/lib.rs`	New `TipResponse` struct with hash, height, cumulative diff, parent hash, and two lifecycle timestamps; `tip_route` handler reads canonical tip and lifecycle data, returns appropriate `ApiError` responses or JSON; route registered at `GET /v1/tip`; serialization and round-trip JSON tests included.
Node context and service initialization `crates/chain/src/chain.rs`, `crates/reth-node-bridge/src/node.rs`, `crates/reth-node-bridge/Cargo.toml`	`IrysNodeCtx` adds shared `block_tree_lifecycle: Arc<BlockTreeLifecycleTimestamps>`; initialized in `init_services()` and passed to `BlockTreeService::spawn_service()`, then propagated into `ApiState`; Reth bridge refactored to compute metrics address from `node_config.metrics.bind_ip` with port selection logic; unit tests validate bind-address isolation and random-ports behavior; `irys-types` with `test-utils` added to dev dependencies.
Observability documentation `docs/99-reference/05-tracing-guidelines.md`	New section documenting `#[instrument(err)]` source-location footgun: OTEL records function-entry location, not error-construction site; recommends explicit `tracing::error!()` at error point for fatal block/validation failures with code example.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Irys-xyz/irys#1421: Both PRs modify on_block_validation_finished invalid-validation handling in block-tree service (metric routing, cache removal, logging behavior).
Irys-xyz/irys#1425: Both PRs update validation-finalization and discard handling in on_block_validation_finished (metrics classification, rejection behavior).
Irys-xyz/irys#1416: Both PRs modify on_block_validation_finished flow for validation-error classification and metric labeling.

Suggested reviewers

JesseTheRobot

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: obs follow-ups and Reth metrics bind hardening' directly addresses the two main objectives: observability improvements (obs follow-ups) and metrics endpoint hardening (Reth metrics bind hardening).
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jason/obs_improvements

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/p2p/src/block_pool.rs (1)
769-773: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Move orphan-retry metric increment after successful dispatch.

Line 769 increments record_block_pool_orphan_retry() before attempting send(...). If the send fails, the metric still counts a retry that was never dispatched. Increment only after a successful send to keep this counter accurate.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/p2p/src/block_pool.rs` around lines 769 - 773, The orphan-retry metric
is incremented before the dispatch, which can falsely count failed sends; move
the call to crate::metrics::record_block_pool_orphan_retry() so it runs only
after self.sync_service_sender.send(...) succeeds. Concretely, in the block
handling SyncChainServiceMessage::AttemptReprocessingBlock(prev_block_hash),
perform the send via self.sync_service_sender.send(...), check the Result (the
Err branch should log the error as before), and call
crate::metrics::record_block_pool_orphan_retry() only in the Ok/success path
(after a successful send) to ensure the metric reflects dispatched retries.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/actors/benches/anchor_resolution.md`:
- Line 8: The metric name in this document uses the underscored form
irys_mempool_pending_submit_unresolvable_anchors but should use the canonical
dotted form irys.mempool.pending_submit_unresolvable_anchors; update every
occurrence (including the other occurrence referenced) to the dotted form so
queries and dashboards remain consistent, e.g., replace
irys_mempool_pending_submit_unresolvable_anchors with
irys.mempool.pending_submit_unresolvable_anchors wherever it appears.

In `@crates/actors/src/block_tree_service.rs`:
- Around line 99-103: The now_ms function uses an unchecked cast d.as_millis()
as i64; replace it with a checked conversion using i64::try_from (or TryInto) to
avoid truncation/overflow: after obtaining the Duration from
SystemTime::now().duration_since(UNIX_EPOCH) convert d.as_millis() with
i64::try_from(...).ok() (or .and_then(...)) and fall back to 0 on error, keeping
the same unwrap_or(0) semantics; update the conversion in the now_ms function to
use these checked conversions instead of the as cast.

In `@crates/actors/src/mempool_service.rs`:
- Around line 2515-2516: The monitoring interval in mempool_service.rs uses
tokio::time::interval (variable "interval") with the default Burst behavior;
change it to use a safer missed-tick behavior (e.g., MissedTickBehavior::Skip or
::Delay) to avoid rapid catch-up executions — create the interval as before then
call interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip)
(or ::Delay) before the initial interval.tick().await; reference the "interval"
variable and the tokio::time::MissedTickBehavior enum when applying the change.
- Line 2520: The cast `count as i64` is unsafe for runtime `usize` values;
replace it with a checked conversion using `i64::try_from(count)` and handle the
`Result` (for example, on `Err` cap to `i64::MAX` and emit a warning) before
calling `crate::metrics::set_anchor_unresolvable`; update the usage around
`set_anchor_unresolvable` and the `count` variable to perform `try_from` and a
clear error/overflow handling path instead of the `as` cast.
- Around line 2514-2522: The anchor-resolution sweep task currently spawned via
handle_for_inner.spawn (assigned to sweep_handle) lacks logging; add structured
logs: emit an info/debug at task startup indicating the sweep loop has started
(include interval duration), and emit periodic debug/info logs inside the loop
(e.g., after calling sweep_inner.count_unresolvable_submit_anchors()) that
report the retrieved count and any unexpected errors; reference the spawn block
(sweep_handle), the sweep_inner.count_unresolvable_submit_anchors() call, and
the crate::metrics::set_anchor_unresolvable() call so logs surround the metric
update and surface failures or anomalous counts.

In `@crates/reth-node-bridge/src/node.rs`:
- Around line 267-279: The tests are using NodeConfig::testnet() but should use
the testing constructor; update the test setup to call
irys_types::NodeConfig::testing() (and any ConsensusConfig::testing() if
present) instead of NodeConfig::testnet() so compute_metrics_addr(&cfg, ...)
gets a test-config; locate the occurrences in the tests around
compute_metrics_addr and replace NodeConfig::testnet() with
NodeConfig::testing().

---

Outside diff comments:
In `@crates/p2p/src/block_pool.rs`:
- Around line 769-773: The orphan-retry metric is incremented before the
dispatch, which can falsely count failed sends; move the call to
crate::metrics::record_block_pool_orphan_retry() so it runs only after
self.sync_service_sender.send(...) succeeds. Concretely, in the block handling
SyncChainServiceMessage::AttemptReprocessingBlock(prev_block_hash), perform the
send via self.sync_service_sender.send(...), check the Result (the Err branch
should log the error as before), and call
crate::metrics::record_block_pool_orphan_retry() only in the Ok/success path
(after a successful send) to ensure the metric reflects dispatched retries.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 220de0ce-086c-4a47-bd2f-5c4d9c972251

📥 Commits

Reviewing files that changed from the base of the PR and between 0aa2edf and 95ef690.

📒 Files selected for processing (19)

crates/actors/benches/anchor_resolution.md
crates/actors/src/block_tree_service.rs
crates/actors/src/block_validation.rs
crates/actors/src/mempool_service.rs
crates/actors/src/mempool_service/lifecycle.rs
crates/actors/src/metrics.rs
crates/api-server/src/lib.rs
crates/api-server/src/routes/mod.rs
crates/api-server/src/routes/tip.rs
crates/chain/src/chain.rs
crates/config/templates/mainnet_config.toml
crates/config/templates/testnet_config.toml
crates/domain/src/models/block_tree.rs
crates/p2p/src/block_pool.rs
crates/p2p/src/metrics.rs
crates/reth-node-bridge/src/node.rs
crates/types/src/config/mod.rs
crates/types/src/config/node.rs
docs/99-reference/05-tracing-guidelines.md

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/actors/src/metrics.rs`:
- Around line 133-139: The helpers record_block_tree_last_block_at_ms and
record_block_tree_last_reorg_at_ms (and the other similar helpers around lines
271-273) use runtime `as u64` casts; replace those casts with a checked
conversion using u64::try_from (or TryInto) on the non-negative i64 value and
handle the Result explicitly (e.g., map Err to 0 or log and return) before
calling .record(...), so you avoid unchecked `as` casts in production metric
helpers and update the same pattern in the other functions mentioned.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c031b9aa-9223-45cb-a342-f81a91ea0658

📥 Commits

Reviewing files that changed from the base of the PR and between 95ef690 and cead573.

📒 Files selected for processing (19)

crates/actors/benches/anchor_resolution.md
crates/actors/src/block_tree_service.rs
crates/actors/src/mempool_service.rs
crates/actors/src/mempool_service/lifecycle.rs
crates/actors/src/metrics.rs
crates/api-server/src/lib.rs
crates/api-server/src/routes/mod.rs
crates/api-server/src/routes/tip.rs
crates/chain/src/chain.rs
crates/config/templates/mainnet_config.toml
crates/config/templates/testnet_config.toml
crates/domain/src/models/block_tree.rs
crates/p2p/src/block_pool.rs
crates/p2p/src/metrics.rs
crates/reth-node-bridge/Cargo.toml
crates/reth-node-bridge/src/node.rs
crates/types/src/config/mod.rs
crates/types/src/config/node.rs
docs/99-reference/05-tracing-guidelines.md

JesseTheRobot

LGTM

github-actions · 2026-05-26T20:35:05Z

Benchmark results: https://irys-xyz.github.io/irys/dev/bench/1426%2Fmerge/index.html

glottologist requested a review from JesseTheRobot May 26, 2026 14:35

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

glottologist added 2 commits May 26, 2026 15:51

feat(obs): land block-tree, mempool, and block-pool primitives

f63b6ce

feat(metrics): default Reth Prometheus endpoint to 127.0.0.1

2c57260

glottologist force-pushed the jason/obs_improvements branch from 95ef690 to b8cafa8 Compare May 26, 2026 15:09

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

Comment thread crates/actors/src/metrics.rs

fix(metrics): restore pub on record_chain_sync_block_rejected

36373d0

glottologist force-pushed the jason/obs_improvements branch 2 times, most recently from 97aefeb to 36373d0 Compare May 26, 2026 17:38

JesseTheRobot approved these changes May 26, 2026

View reviewed changes

glottologist merged commit 30a6ac6 into master May 27, 2026
22 of 31 checks passed

glottologist deleted the jason/obs_improvements branch May 27, 2026 07:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: obs follow-ups and Reth metrics bind hardening#1426

feat: obs follow-ups and Reth metrics bind hardening#1426
glottologist merged 3 commits into
masterfrom
jason/obs_improvements

glottologist commented May 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

JesseTheRobot left a comment

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

glottologist commented May 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JesseTheRobot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

glottologist commented May 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading