Skip to content

feat: obs follow-ups and Reth metrics bind hardening#1426

Merged
glottologist merged 3 commits into
masterfrom
jason/obs_improvements
May 27, 2026
Merged

feat: obs follow-ups and Reth metrics bind hardening#1426
glottologist merged 3 commits into
masterfrom
jason/obs_improvements

Conversation

@glottologist

@glottologist glottologist commented May 26, 2026

Copy link
Copy Markdown
Contributor

Describe the changes
Before: Observability gaps in block-tree/mempool/block-pool surfaced by the 2026-05-23 fork investigation; Reth Prometheus endpoint bound 0.0.0.0 by default.

After: New OTEL counters/gauges, /v1/tip endpoint, and a [metrics] bind_ip config section defaulting to 127.0.0.1.

Related Issue(s)
Please link to the issue(s) that will be closed with this PR.

Checklist

  • Tests have been added/updated for the changes.
  • Documentation has been updated for the changes (if applicable).
  • The code follows Rust's style guidelines.

Additional Context
Add any other context about the pull request here.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added block lifecycle tracking to record canonical block advances and reorganizations with timestamps.
    • Introduced /v1/tip API endpoint returning current block information with lifecycle metrics.
    • Added monitoring for unresolvable submit anchors in the mempool.
  • Chores

    • Implemented OpenTelemetry metrics for block lifecycle timestamps and mempool monitoring.
    • Added block pool rejection and retry tracking metrics.
    • Updated configuration templates with metrics endpoint settings.
  • Documentation

    • Added debugging guidelines for error tracing patterns.

Review Change Stack

@coderabbitai

coderabbitai Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6d03bdb6-a3ea-47cf-86d1-9a4174d3fdfe

📥 Commits

Reviewing files that changed from the base of the PR and between 97aefeb and 36373d0.

📒 Files selected for processing (6)
  • crates/actors/src/block_tree_service.rs
  • crates/actors/src/mempool_service.rs
  • crates/actors/src/metrics.rs
  • crates/p2p/src/block_pool.rs
  • crates/reth-node-bridge/Cargo.toml
  • crates/reth-node-bridge/src/node.rs

📝 Walkthrough

Walkthrough

This PR establishes comprehensive observability infrastructure: block-tree lifecycle timestamps, block removal categorization, mempool anchor-resolution monitoring, block-pool rejection metrics, metrics configuration, a /v1/tip API endpoint, and required service wiring across the node.

Changes

Block lifecycle and validation observability

Layer / File(s) Summary
Metrics configuration types and defaults
crates/types/src/config/node.rs, crates/types/src/config/mod.ts, crates/config/templates/*
New MetricsConfig struct with loopback default 127.0.0.1; added to NodeConfig with serde defaulting; fixtures and integration tests verify defaults and overrides; config templates updated for mainnet and testnet.
Metrics declarations and recording helpers
crates/actors/src/metrics.rs, crates/p2p/src/metrics.rs
Block-tree lifecycle gauges (last_block_at_ms, last_reorg_at_ms), mempool anchor-unresolvable gauge, and block-pool rejection/orphan-retry counters declared; pub(crate) recording helpers with i64-to-u64 clamping and unit tests.
Block-tree lifecycle timestamp tracking
crates/actors/src/block_tree_service.rs
New BlockTreeLifecycleTimestamps struct with atomic unix-ms fields and public accessors; wired into BlockTreeServiceInner and spawn_service; note_reorg() and note_canonical_advance() called at respective events; unit tests verify timestamp updates from zero initial state.
Block cache removal reason tracking
crates/domain/src/models/block_tree.rs
New RemovalReason enum with variants for validation invalid, prune depth, and recursive descendants; as_log_value() provides stable structured-log output including parent-hash for recursive removals; remove_block signature updated to require reason; prune and validation paths updated; unit tests validate reason strings and parent-hash propagation.
Mempool anchor-resolution monitoring
crates/actors/src/mempool_service.rs, crates/actors/src/mempool_service/lifecycle.rs
New count_unresolvable_submit_anchors() method counts pending Submit txs with anchors resolving to Ok(None), treating resolution errors as resolvable; 30-second tokio interval sweep spawned in spawn_service, skipping first tick, updating gauge; sweep aborted on startup failure.
Block pool rejection metrics
crates/p2p/src/block_pool.rs
Labeled rejection metrics emitted for already_processed, part_of_pruned_fork, reth_payload_unavailable, fatal_cache_corruption, parent_not_in_cache, internal_prevalidation_failure, block_validation_error, and try_reprocess_finalized paths in process_block and check_block_status; orphan-retry metric recorded on successful reprocess dispatch.
GET /v1/tip API endpoint
crates/api-server/src/routes/tip.rs, crates/api-server/src/routes/mod.rs, crates/api-server/src/lib.rs
New TipResponse struct with hash, height, cumulative diff, parent hash, and two lifecycle timestamps; tip_route handler reads canonical tip and lifecycle data, returns appropriate ApiError responses or JSON; route registered at GET /v1/tip; serialization and round-trip JSON tests included.
Node context and service initialization
crates/chain/src/chain.rs, crates/reth-node-bridge/src/node.rs, crates/reth-node-bridge/Cargo.toml
IrysNodeCtx adds shared block_tree_lifecycle: Arc<BlockTreeLifecycleTimestamps>; initialized in init_services() and passed to BlockTreeService::spawn_service(), then propagated into ApiState; Reth bridge refactored to compute metrics address from node_config.metrics.bind_ip with port selection logic; unit tests validate bind-address isolation and random-ports behavior; irys-types with test-utils added to dev dependencies.
Observability documentation
docs/99-reference/05-tracing-guidelines.md
New section documenting #[instrument(err)] source-location footgun: OTEL records function-entry location, not error-construction site; recommends explicit tracing::error!() at error point for fatal block/validation failures with code example.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • Irys-xyz/irys#1421: Both PRs modify on_block_validation_finished invalid-validation handling in block-tree service (metric routing, cache removal, logging behavior).
  • Irys-xyz/irys#1425: Both PRs update validation-finalization and discard handling in on_block_validation_finished (metrics classification, rejection behavior).
  • Irys-xyz/irys#1416: Both PRs modify on_block_validation_finished flow for validation-error classification and metric labeling.

Suggested reviewers

  • JesseTheRobot
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: obs follow-ups and Reth metrics bind hardening' directly addresses the two main objectives: observability improvements (obs follow-ups) and metrics endpoint hardening (Reth metrics bind hardening).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jason/obs_improvements

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/p2p/src/block_pool.rs (1)

769-773: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Move orphan-retry metric increment after successful dispatch.

Line 769 increments record_block_pool_orphan_retry() before attempting send(...). If the send fails, the metric still counts a retry that was never dispatched. Increment only after a successful send to keep this counter accurate.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/p2p/src/block_pool.rs` around lines 769 - 773, The orphan-retry metric
is incremented before the dispatch, which can falsely count failed sends; move
the call to crate::metrics::record_block_pool_orphan_retry() so it runs only
after self.sync_service_sender.send(...) succeeds. Concretely, in the block
handling SyncChainServiceMessage::AttemptReprocessingBlock(prev_block_hash),
perform the send via self.sync_service_sender.send(...), check the Result (the
Err branch should log the error as before), and call
crate::metrics::record_block_pool_orphan_retry() only in the Ok/success path
(after a successful send) to ensure the metric reflects dispatched retries.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/actors/benches/anchor_resolution.md`:
- Line 8: The metric name in this document uses the underscored form
irys_mempool_pending_submit_unresolvable_anchors but should use the canonical
dotted form irys.mempool.pending_submit_unresolvable_anchors; update every
occurrence (including the other occurrence referenced) to the dotted form so
queries and dashboards remain consistent, e.g., replace
irys_mempool_pending_submit_unresolvable_anchors with
irys.mempool.pending_submit_unresolvable_anchors wherever it appears.

In `@crates/actors/src/block_tree_service.rs`:
- Around line 99-103: The now_ms function uses an unchecked cast d.as_millis()
as i64; replace it with a checked conversion using i64::try_from (or TryInto) to
avoid truncation/overflow: after obtaining the Duration from
SystemTime::now().duration_since(UNIX_EPOCH) convert d.as_millis() with
i64::try_from(...).ok() (or .and_then(...)) and fall back to 0 on error, keeping
the same unwrap_or(0) semantics; update the conversion in the now_ms function to
use these checked conversions instead of the as cast.

In `@crates/actors/src/mempool_service.rs`:
- Around line 2515-2516: The monitoring interval in mempool_service.rs uses
tokio::time::interval (variable "interval") with the default Burst behavior;
change it to use a safer missed-tick behavior (e.g., MissedTickBehavior::Skip or
::Delay) to avoid rapid catch-up executions — create the interval as before then
call interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip)
(or ::Delay) before the initial interval.tick().await; reference the "interval"
variable and the tokio::time::MissedTickBehavior enum when applying the change.
- Line 2520: The cast `count as i64` is unsafe for runtime `usize` values;
replace it with a checked conversion using `i64::try_from(count)` and handle the
`Result` (for example, on `Err` cap to `i64::MAX` and emit a warning) before
calling `crate::metrics::set_anchor_unresolvable`; update the usage around
`set_anchor_unresolvable` and the `count` variable to perform `try_from` and a
clear error/overflow handling path instead of the `as` cast.
- Around line 2514-2522: The anchor-resolution sweep task currently spawned via
handle_for_inner.spawn (assigned to sweep_handle) lacks logging; add structured
logs: emit an info/debug at task startup indicating the sweep loop has started
(include interval duration), and emit periodic debug/info logs inside the loop
(e.g., after calling sweep_inner.count_unresolvable_submit_anchors()) that
report the retrieved count and any unexpected errors; reference the spawn block
(sweep_handle), the sweep_inner.count_unresolvable_submit_anchors() call, and
the crate::metrics::set_anchor_unresolvable() call so logs surround the metric
update and surface failures or anomalous counts.

In `@crates/reth-node-bridge/src/node.rs`:
- Around line 267-279: The tests are using NodeConfig::testnet() but should use
the testing constructor; update the test setup to call
irys_types::NodeConfig::testing() (and any ConsensusConfig::testing() if
present) instead of NodeConfig::testnet() so compute_metrics_addr(&cfg, ...)
gets a test-config; locate the occurrences in the tests around
compute_metrics_addr and replace NodeConfig::testnet() with
NodeConfig::testing().

---

Outside diff comments:
In `@crates/p2p/src/block_pool.rs`:
- Around line 769-773: The orphan-retry metric is incremented before the
dispatch, which can falsely count failed sends; move the call to
crate::metrics::record_block_pool_orphan_retry() so it runs only after
self.sync_service_sender.send(...) succeeds. Concretely, in the block handling
SyncChainServiceMessage::AttemptReprocessingBlock(prev_block_hash), perform the
send via self.sync_service_sender.send(...), check the Result (the Err branch
should log the error as before), and call
crate::metrics::record_block_pool_orphan_retry() only in the Ok/success path
(after a successful send) to ensure the metric reflects dispatched retries.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 220de0ce-086c-4a47-bd2f-5c4d9c972251

📥 Commits

Reviewing files that changed from the base of the PR and between 0aa2edf and 95ef690.

📒 Files selected for processing (19)
  • crates/actors/benches/anchor_resolution.md
  • crates/actors/src/block_tree_service.rs
  • crates/actors/src/block_validation.rs
  • crates/actors/src/mempool_service.rs
  • crates/actors/src/mempool_service/lifecycle.rs
  • crates/actors/src/metrics.rs
  • crates/api-server/src/lib.rs
  • crates/api-server/src/routes/mod.rs
  • crates/api-server/src/routes/tip.rs
  • crates/chain/src/chain.rs
  • crates/config/templates/mainnet_config.toml
  • crates/config/templates/testnet_config.toml
  • crates/domain/src/models/block_tree.rs
  • crates/p2p/src/block_pool.rs
  • crates/p2p/src/metrics.rs
  • crates/reth-node-bridge/src/node.rs
  • crates/types/src/config/mod.rs
  • crates/types/src/config/node.rs
  • docs/99-reference/05-tracing-guidelines.md

Comment thread crates/actors/benches/anchor_resolution.md Outdated
Comment thread crates/actors/src/block_tree_service.rs
Comment thread crates/actors/src/mempool_service.rs
Comment thread crates/actors/src/mempool_service.rs
Comment thread crates/actors/src/mempool_service.rs Outdated
Comment thread crates/reth-node-bridge/src/node.rs Outdated
@glottologist glottologist force-pushed the jason/obs_improvements branch from 95ef690 to b8cafa8 Compare May 26, 2026 15:09

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/actors/src/metrics.rs`:
- Around line 133-139: The helpers record_block_tree_last_block_at_ms and
record_block_tree_last_reorg_at_ms (and the other similar helpers around lines
271-273) use runtime `as u64` casts; replace those casts with a checked
conversion using u64::try_from (or TryInto) on the non-negative i64 value and
handle the Result explicitly (e.g., map Err to 0 or log and return) before
calling .record(...), so you avoid unchecked `as` casts in production metric
helpers and update the same pattern in the other functions mentioned.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c031b9aa-9223-45cb-a342-f81a91ea0658

📥 Commits

Reviewing files that changed from the base of the PR and between 95ef690 and cead573.

📒 Files selected for processing (19)
  • crates/actors/benches/anchor_resolution.md
  • crates/actors/src/block_tree_service.rs
  • crates/actors/src/mempool_service.rs
  • crates/actors/src/mempool_service/lifecycle.rs
  • crates/actors/src/metrics.rs
  • crates/api-server/src/lib.rs
  • crates/api-server/src/routes/mod.rs
  • crates/api-server/src/routes/tip.rs
  • crates/chain/src/chain.rs
  • crates/config/templates/mainnet_config.toml
  • crates/config/templates/testnet_config.toml
  • crates/domain/src/models/block_tree.rs
  • crates/p2p/src/block_pool.rs
  • crates/p2p/src/metrics.rs
  • crates/reth-node-bridge/Cargo.toml
  • crates/reth-node-bridge/src/node.rs
  • crates/types/src/config/mod.rs
  • crates/types/src/config/node.rs
  • docs/99-reference/05-tracing-guidelines.md

Comment thread crates/actors/src/metrics.rs
@glottologist glottologist force-pushed the jason/obs_improvements branch 2 times, most recently from 97aefeb to 36373d0 Compare May 26, 2026 17:38

@JesseTheRobot JesseTheRobot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions

Copy link
Copy Markdown
Contributor

@glottologist glottologist merged commit 30a6ac6 into master May 27, 2026
22 of 31 checks passed
@glottologist glottologist deleted the jason/obs_improvements branch May 27, 2026 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants