feat: add cross-node request lifecycle tracing and observability improvements by glottologist · Pull Request #1156 · Irys-xyz/irys

glottologist · 2026-02-20T14:49:13Z

Describe the changes
Before:
Each node was traced independently with no correlation across node boundaries. Gossip HTTP requests carried no trace context, making it impossible to follow a request from ingestion through validation. JSON structured logging was commented out and unavailable.

After:
Implements trace context propagation on all gossip HTTP requests, explicit RequestId (UUID v7) on GossipRequestV2 messages, tracing::Span propagation through all inter-service message enums, and runtime-selectable JSON structured logging via IRYS_LOG_FORMAT=json.

Changes

Cross-Node Trace Propagation (`crates/p2p`)

Inject traceparent/tracestate headers on all outbound gossip HTTP requests via inject_trace_context() using OpenTelemetry's TextMapPropagator
Register TraceContextPropagator globally in telemetry.rs
Add HeaderInjector adapter for reqwest::header::HeaderMap
Apply trace injection to all gossip client methods: get_info, request_peer_list, handshake_v1/v2, request_block_index, get_protocol_version, health_check, send_preserialized, send_data_internal

Request ID (`crates/types`)

Add RequestId type —UUID v7
Add Option<RequestId> field to GossipRequestV2 with #[serde(default)] for backward compatibility with older peers
Server-side gossip handlers generate a RequestId when not present, then create named spans (gossip.handle_chunk, gossip.handle_block_header, etc.) with the ID as a span field

Intra-Node Span Propagation (`crates/actors`)

Add span: tracing::Span and request_id: Option<RequestId> fields to:
- BlockDiscoveryMessage::BlockDiscovered
- BlockTreeServiceMessage::BlockPreValidated
- BlockTreeServiceMessage::BlockValidationFinished
- ValidationServiceMessage::ValidateBlock
- MempoolServiceMessage::IngestDataTxFromApi/Gossip
- MempoolServiceMessage::IngestCommitmentTxFromApi/Gossip
Receivers create child spans with parent: &parent_span linking the full request lifecycle
BlockValidationTask carries parent_span and request_id through VDF scheduling and concurrent validation

JSON Structured Logging (`crates/chain`, `crates/utils`)

Add IRYS_LOG_FORMAT=json runtime toggle in both init_tracing() (non-telemetry path) and setup_tracing_subscriber() (telemetry path)
JSON output includes with_current_span(true) and with_span_list(true) for full span context in each log line

Design Decision: Why `RequestId` Instead of Tracing Span IDs

This PR uses both an explicit RequestId and tracing::Span propagation. They serve different purposes and are not interchangeable. Tracing span IDs have properties that make them unsuitable as cross-node request correlation keys.

Property	Tracing Span ID	RequestId
Globally unique	No (slab index, reused)	Yes (UUID v7, 128-bit)
Serialisable across nodes	No (`Id` lacks `Serialise`)	Yes
Appears in log output	No (not rendered by fmt/json)	Yes (recorded as span field)
Time-ordered / sortable	No	Yes (UUID v7 embeds ms timestamp)
Survives channel sends	No (implicit context lost)	Yes (explicit field)

Why not reuse span IDs as request IDs?

Not globally unique: Span IDs are sharded_slab::Pool indices (small sequential integers starting at 1). Slots are reused after spans are dropped. Two nodes produce overlapping ID spaces, and the same node reuses IDs over time.
Not serialisable: tracing_core::span::Id derives Clone, Debug, PartialEq, Eq, Hash only — no Serialise/Deserialise/Display. The inner u64 is accessible via into_u64() but is a raw slab index with no semantic meaning outside the originating subscriber.
Not rendered in logs: The tracing-subscriber JSON formatter's SerializableSpan serialises span name and fields but not the span ID. There is no built-in way to grep for a span ID in log output.
No embedded timestamp: Span IDs are slab indices — monotonically assigned but recycled. RequestId uses UUID v7 with a 48-bit millisecond timestamp, making IDs naturally time-ordered and sortable.
Implicit propagation only: The current span is thread-local state — tokio::spawn does not propagate it automatically (requires .instrument()), and channel sends do not carry it.

Why not derive RequestId from the span ID (e.g. embed it in the constructor)?

Embedding the span ID (a small integer like 1, 2, 47) in place of the random bytes would sacrifice UUID v7's global uniqueness guarantee. Two nodes creating a RequestId at the same millisecond with the same slab index would collide.
The span that created the RequestId may have been dropped and its slab slot reused by the time someone reads the ID on a remote node. Extracting the span ID from the RequestId would point at the wrong (or no) span.
The correlation already exists in the other direction: RequestId is recorded as a span field (request.id = %request_id), so every span already carries its RequestId. You can search from span → RequestId (via the field) and from RequestId → spans (via grep/trace search). There is no additional benefit to embedding the span ID in the RequestId.

Related Issue(s)
Please link to the issue(s) that will be closed with this PR.

Checklist

Tests have been added/updated for the changes.
Documentation has been updated for the changes (if applicable).
The code follows Rust's style guidelines.

Additional Context
Add any other context about the pull request here.

Summary by CodeRabbit

New Features
- System-wide propagation of per-request tracing context (spans) for end-to-end observability.
- HTTP request metrics middleware capturing latency and request counts.
Improvements
- Broader trace context propagation across validation, mempool, p2p/gossip, and block processing flows, plus outbound trace headers.
- Dashboard linking: traces→logs integration and tracing/telemetry initialization refinements.
- Added operational metrics (including request duration, request counts, and a new head-height metric).
Other
- Minor docs, tests, and internal cleanup.

…macro

coderabbitai · 2026-02-20T14:49:19Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Propagates tracing::Span across actor messages, validation tasks, mempool ingress, API routes, and P2P gossip; adds HTTP request metrics middleware, OpenTelemetry trace-context propagation, UUIDv7 trace ID generation, metric adjustments, and related test updates.

Changes

Cohort / File(s)	Summary
Actor span propagation `crates/actors/src/block_discovery.rs`, `crates/actors/src/block_tree_service.rs`, `crates/actors/src/validation_service.rs`, `crates/actors/src/validation_service/block_validation_task.rs`	Added `tracing::Span` to message enums and task structs (BlockDiscovered, BlockPreValidated, BlockValidationFinished, ValidateBlock, BlockValidationTask.parent_span); threaded spans through handlers and instrumented key operations with parent spans.
Mempool actor & facade `crates/actors/src/mempool_service.rs`, `crates/actors/src/mempool_service/facade.rs`	Extended four ingest message variants to include `tracing::Span`; updated variant matching, handlers, and facade to supply `Span::current()`.
API server metrics & routes `crates/api-server/src/metrics.rs`, `crates/api-server/src/lib.rs`, `crates/api-server/src/routes/commitment.rs`, `crates/api-server/src/routes/tx.rs`	Added `RequestMetrics` Actix middleware, `REQUEST_DURATION_MS` and `REQUESTS_TOTAL` metrics; wired middleware into server startup; API routes now attach `tracing::Span::current()` to mempool messages.
P2P trace propagation & request conversion `crates/p2p/src/gossip_client.rs`, `crates/p2p/src/server.rs`, `crates/types/src/gossip.rs`, `crates/p2p/Cargo.toml`	Introduced OpenTelemetry header injection helpers (`HeaderInjector`, `inject_trace_context`, `traced_headers`) and applied traced headers to outbound HTTP calls; centralized V1→V2 request conversion via `GossipRequestV1::into_v2`, and added workspace flag for `tracing-opentelemetry`.
Telemetry infra & ID generation `crates/utils/utils/src/lib.rs`, `crates/utils/utils/src/telemetry.rs`, `crates/chain/src/main.rs`	Added `make_fmt_layer()`, registered TraceContext propagator, introduced private `UuidV7IdGenerator` for trace IDs, and changed tracing subscriber initialization to use the unified formatting layer.
Metrics, macros & small metric changes `crates/chain/src/metrics.rs`, `crates/utils/utils/src/metric_macros.rs`, `crates/p2p/src/metrics.rs`	Added `RETH_FCU_HEAD_HEIGHT` and `record_peer_fetch_error`; removed public `MetricDescriptor` and emitted `METRICS` descriptors from macro output; changed advisory metric label to boolean in some metrics.
Validation & block-tree wiring `crates/actors/src/block_tree_service.rs`, `crates/actors/src/block_discovery.rs`, `crates/actors/src/validation_service.rs`	Block pre-validation and validation finished messaging now carry spans; handlers instrumented to propagate spans into validation and reporting paths.
Tests & test helpers `crates/chain/tests/*`, `crates/chain/tests/utils.rs`	Updated many tests and helpers to pass `tracing::Span::current()` to new mempool/validation message variants and helpers.
Chunk ingress & storage ordering `crates/actors/src/chunk_ingress_service/chunks.rs`, `crates/actors/src/chunk_ingress_service/chunk_data_writer.rs`	Moved to write-behind queue for chunk writes, adjusted flush timing, and removed/updated explanatory comments; no behavioral API changes.
Minor efficiency and cleanup `crates/p2p/src/gossip_data_handler.rs`, `crates/database/src/database.rs`	Optimized Arc handling with `Arc::unwrap_or_clone`, updated comments, and guidance for cache DB recovery; no logic changes.
Manifests & deps `crates/utils/utils/Cargo.toml`, `Cargo.toml`, `crates/p2p/Cargo.toml`	Adjusted workspace dependency flags, added optional `rand` workspace dep, and updated `rand` version/features in root Cargo.toml.
Observability configs `docker/observation/configs/grafana/...`, `docker/observation/configs/otel-collector/config.yaml`	Added Tempo tracesToLogs linking for Elasticsearch; removed spanmetrics exporter/processor from otel-collector pipelines.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API_Server
    participant Mempool
    participant Validation
    participant BlockTree
    participant P2P
    rect rgba(0,128,255,0.5)
    Client->>API_Server: HTTP request (creates tracing span)
    API_Server->>Mempool: Ingest*Tx(..., span=Span::current())
    end
    rect rgba(0,200,100,0.5)
    Mempool->>Validation: ValidateBlock(block, skip_vdf, span)
    Validation->>BlockTree: BlockValidationFinished(hash, result, span)
    end
    rect rgba(200,100,0,0.5)
    BlockTree->>P2P: Gossip outbound (attach traced headers)
    P2P->>External: HTTP call with traced headers (inject_trace_context)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

refactor: extract ChunkIngressService from mempool #1155 — touches the actors/mempool messaging surface and may overlap with the message-variant changes here.

Suggested reviewers

antouhou
DanMacDonald
glottologist

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the main changes: adding cross-node request lifecycle tracing and observability enhancements, which aligns with the substantial refactoring of span propagation, request ID threading, and JSON logging throughout the codebase.
Docstring Coverage	✅ Passed	Docstring coverage is 97.18% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch jason/enhanced_tracing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)

docker/observation/configs/otel-collector/config.yaml (2)
62-63: 🧹 Nitpick | 🔵 Trivial

Unused debug exporter — remove or comment out.

The debug exporter is declared but never added to any pipeline, so it is silently disabled. Configuring an exporter does not enable it; exporters are enabled by adding them to the appropriate pipelines within the service section. Either wire it into a pipeline when troubleshooting is needed, or remove it to keep the configuration free of dead declarations.
🧹 Proposed cleanup
-  # Debug exporter for troubleshooting (optional)
-  debug:
-    verbosity: basic
-
 service:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/observation/configs/otel-collector/config.yaml` around lines 62 - 63,
The config declares a debug exporter block named "debug" but never uses it;
either remove or comment out the "debug:" exporter stanza or add "debug" to the
appropriate pipeline(s) under the service.pipelines section so it becomes active
during troubleshooting; locate the "debug" exporter key in the config and either
delete/comment that block or add "debug" to the desired pipeline's exporters
list (e.g., service.pipelines.<your_pipeline>.exporters).
19-26: ⚠️ Potential issue | 🟠 Major

Remove the now-orphaned spanmetrics connector definition.

The spanmetrics connector was removed from both its pipeline roles — traces.exporters (line 75) and metrics.receivers (line 77) — but its full definition still lives in the connectors: block. Configuring a connector doesn't enable it; connectors are enabled through pipelines within the service section. So the collector won't error on startup today. However, leaving the definition in place is dangerous: each connector MUST be used as both, in separate pipelines, so if a future edit re-wires spanmetrics into only one pipeline role (e.g., only back into traces.exporters), the collector will fail at runtime with "failed to build pipelines: connector spanmetrics used as exporter … but not used in any supported receiver pipeline". This class of error is confirmed by real-world collector failures documented in the upstream tracker.

Delete the entire connectors: section since no other connector is defined.
🗑️ Proposed fix: remove orphaned connector block
-connectors:
-  spanmetrics:
-    histogram:
-      explicit:
-        buckets: [1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s, 30s]
-    dimensions:
-      - name: service.name
-    metrics_expiration: 5m
-
 processors:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/observation/configs/otel-collector/config.yaml` around lines 19 - 26,
Remove the orphaned connectors block that defines the spanmetrics connector:
delete the entire connectors: section (which contains spanmetrics with
histogram, dimensions, and metrics_expiration) since spanmetrics is no longer
referenced by traces.exporters or metrics.receivers; this prevents a future
runtime failure where a connector is used only as an exporter or receiver.
Ensure no other connectors remain before removing the connectors: block
entirely.
crates/utils/utils/src/metric_macros.rs (1)
1-20: 🧹 Nitpick | 🔵 Trivial

Remove orphaned @type_str macro arms — they are no longer invoked.

The three @type_str arms (counter, gauge, histogram) were previously used when the macro generated a METRICS descriptor array. After removal of that code path, these arms became dead code. Search of the entire codebase confirms no invocations of @type_str exist; all define_metrics! calls use only the top-level macro syntax.

The OpenTelemetry 0.31.0 builder API is correctly implemented: .u64_counter(), .u64_gauge(), .f64_histogram(), .with_description(), .with_boundaries(), and .build() all match the public API for that version.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/utils/utils/src/metric_macros.rs` around lines 1 - 20, Remove the dead
macro arms named `@type_str` from the `define_metrics!` macro in
metric_macros.rs: delete the three orphaned `@type_str` variants that handled
`counter`, `gauge`, and `histogram` since they are no longer referenced by any
`define_metrics!` invocation; keep the top-level macro arms that build
`.u64_counter()`, `.u64_gauge()`, `.f64_histogram()`, `.with_description()`,
`.with_boundaries()`, and `.build()` intact and run tests to ensure nothing else
relied on the removed arms.
crates/actors/src/validation_service.rs (1)
248-278: 🧹 Nitpick | 🔵 Trivial

Consider propagating the original request's parent_span through to BlockValidationFinished.

All three BlockValidationFinished construction sites use span: tracing::Span::current(), which captures the validation service's event-loop span rather than the original parent_span passed via ValidateBlock. As a result, BlockTreeService operations triggered by validation results appear as children of the validation service's loop span in traces — not of the original API/gossip request that initiated the validation.

The request_id correctly provides end-to-end correlation (per the PR's design rationale), so this doesn't affect functional correctness. But the span hierarchy is broken at this boundary: the full chain original_request_span → validation_stages → block_tree_handling is not represented.

If richer span hierarchies are desired, the fix is to propagate parent_span alongside request_id through process_vdf() and ConcurrentValidationResult:
♻️ Optional: carry parent_span through to BlockValidationFinished

In active_validations.rs, extend process_vdf() return type and ConcurrentValidationResult:
-) -> Option<(BlockHash, VdfValidationResult, Option<irys_types::RequestId>)>
+) -> Option<(BlockHash, VdfValidationResult, Option<irys_types::RequestId>, tracing::Span)>
 pub struct ConcurrentValidationResult {
     pub block_hash: BlockHash,
     pub validation_result: ValidationResult,
     pub request_id: Option<irys_types::RequestId>,
+    pub parent_span: tracing::Span,
 }
Then in validation_service.rs, replace span: tracing::Span::current() with the propagated span in all BlockValidationFinished sites.
Also applies to: 282-295
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/actors/src/validation_service.rs` around lines 248 - 278, The current
VDF handling in process_vdf()/ConcurrentValidationResult drops the original
request span so BlockValidationFinished uses tracing::Span::current() (the
validation loop span) instead of the originating parent_span from ValidateBlock;
update active_validations::process_vdf() and the ConcurrentValidationResult to
carry an Option<tracing::Span> (e.g., parent_span) alongside request_id and
validation result, modify the callers that build ConcurrentValidationResult to
attach the original parent_span from ValidateBlock, and in validation_service
when constructing BlockValidationFinished replace span: tracing::Span::current()
with the propagated span (or fallback to current() if None) so the
BlockTreeService sees the original request's span as the parent.
crates/p2p/src/block_pool.rs (2)
633-643: 🧹 Nitpick | 🔵 Trivial

request_id missing from #[instrument] span fields.

The skip_all attribute excludes request_id from the auto-generated span, so the request ID won't be visible in traces for the block processing span. Given the PR's explicit goal of per-request tracing correlation, adding it to fields(...) would be valuable.
🔭 Add `request_id` to instrument fields
 #[instrument(
     skip_all,
     target = "BlockPool",
-    fields(block.hash = ?block.header().block_hash, block.height = block.header().height),
+    fields(block.hash = ?block.header().block_hash, block.height = block.header().height, request_id = ?request_id),
 )]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/p2p/src/block_pool.rs` around lines 633 - 643, Add request_id to the
tracing span for process_block by including it in the #[instrument(...)] fields:
update the attribute on pub(crate) async fn process_block(...) to include
request_id = ?request_id alongside block.hash and block.height (e.g.,
fields(block.hash = ?block.header().block_hash, block.height =
block.header().height, request_id = ?request_id)); keep skip_all if you need to
avoid auto-capturing other args, but ensure request_id is explicitly listed so
traces include it.
1058-1109: 🧹 Nitpick | 🔵 Trivial

Background task drops span context — breaks the intra-node trace chain.

tokio::spawn at Line 1071 doesn't call .in_current_span(), so the spawned task detaches from the active OpenTelemetry span. Any traceparent propagated via the PR's W3C tracecontext machinery won't extend into this background payload fetch. This directly undermines the PR's intra-node span propagation goal.
🔭 Inherit span context in the background task

Add the import at the top of the file (if not already present):
+use tracing::Instrument as _;
Then instrument the spawned task:
-        tokio::spawn(async move {
+        tokio::spawn(async move {
             match Self::pull_and_seal_execution_payload(
                 &execution_payload_provider,
                 &chain_sync_sender,
                 evm_block_hash,
                 use_trusted_peers_only,
                 None,
                 request_id,
             )
             .await
             {
                 // ...
             }
-        });
+        }.in_current_span());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/p2p/src/block_pool.rs` around lines 1058 - 1109, The background task
spawned in pull_and_seal_execution_payload_in_background detaches from the
active tracing span because tokio::spawn is called without inheriting the
current span; fix by instrumenting the spawned future so it inherits the current
span (e.g., import tracing::Instrument if missing) and call .in_current_span()
(or .instrument(tracing::Span::current())) on the async block passed to
tokio::spawn for the task that uses execution_payload_provider,
gossip_broadcast_sender, and chain_sync_sender so the span context is preserved
into the background payload fetch.
crates/p2p/src/chain_sync.rs (1)
547-620: 🧹 Nitpick | 🔵 Trivial

Three updated task spawns in handle_message drop the span context.

RequestBlockFromTheNetwork (Line 557), PullPayloadFromTheNetwork (Line 579), and AttemptReprocessingBlock (Line 599) all spawn tasks that now propagate request_id but don't call .in_current_span(). The active OpenTelemetry span (and any injected traceparent) is lost at each spawn boundary. Instrument as _ is already imported (Line 23); the fix is identical for all three.
🔭 Propagate span context into all three message-handler spawns
 // RequestBlockFromTheNetwork handler
-                tokio::spawn(async move {
+                tokio::spawn(async move {
                     let result = inner
                         .request_parent_block(parent_block_hash, request_id)
                         .await;
                     if let Some(sender) = response {
                         if let Err(e) = sender.send(result) {
                             tracing::error!("Failed to send response: {:?}", e);
                         }
                     }
-                });
+                }.in_current_span());

 // PullPayloadFromTheNetwork handler
-                tokio::spawn(async move {
+                tokio::spawn(async move {
                     let result = inner
                         .gossip_data_handler
                         .pull_and_add_execution_payload_to_cache(
                             evm_block_hash,
                             use_trusted_peers_only,
                             request_id,
                         )
                         .await;
                     if let Err(e) = response.send(result) {
                         tracing::error!("Failed to send response: {:?}", e);
                     }
-                });
+                }.in_current_span());

 // AttemptReprocessingBlock handler
-                tokio::spawn(async move {
+                tokio::spawn(async move {
                     if let Some(cached_block) = inner.block_pool.get_cached_block(&block_hash).await
                     {
                         if let Err(e) = inner
                             .block_pool
                             .process_block(
                                 Arc::clone(&cached_block.block),
                                 cached_block.is_fast_tracking,
                                 cached_block.request_id,
                             )
                             .await
                         {
                             error!("Failed to reprocess block {:?}: {:?}", block_hash, e);
                         }
                     } else {
                         error!(
                             "Cannot reprocess block {:?} - not found in cache",
                             block_hash
                         );
                     }
-                });
+                }.in_current_span());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/p2p/src/chain_sync.rs` around lines 547 - 620, The spawned tasks for
SyncChainServiceMessage::RequestBlockFromTheNetwork,
::PullPayloadFromTheNetwork, and ::AttemptReprocessingBlock drop the current
tracing/OpenTelemetry span because they call tokio::spawn without
instrumentation; wrap each tokio::spawn(...) future with .in_current_span()
(Instrument as _ is already imported) so the request trace is propagated.
Concretely, apply .in_current_span() to the spawned futures that call
inner.request_parent_block(...),
inner.gossip_data_handler.pull_and_add_execution_payload_to_cache(...), and the
block reprocessing future that invokes inner.block_pool.process_block(...) so
the active span/traceparent is preserved across the spawn boundary.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/actors/src/validation_service/active_validations.rs`:
- Around line 354-359: The code creates two identical bindings request_id and
task_request_id from task.request_id inside the VdfValidationResult::Valid arm;
remove the duplicate by capturing request_id once (e.g., before the match or use
task.request_id directly in the Valid arm) and reference that single binding
wherever task_request_id was used (including the return near where request_id is
expected), ensuring all uses of task_request_id are updated to the single
request_id binding.

In `@crates/chain/src/main.rs`:
- Around line 90-112: The init_tracing code duplicates log-format selection
logic from telemetry.rs::setup_tracing_subscriber (reading IRYS_LOG_FORMAT and
building JSON vs plain fmt layers) and also omits
.with_span_events(FmtSpan::NONE) present in telemetry.rs; refactor by extracting
the shared formatter construction into a single helper (e.g., make_fmt_layer()
-> Box<dyn Layer<...> + Send + Sync>) placed in irys-utils, update
telemetry.rs::setup_tracing_subscriber and crates/chain::init_tracing to call
make_fmt_layer() and only handle selection/initialization there so both JSON and
plain-text options reuse the exact same configuration (including
.with_span_events(FmtSpan::NONE)) to remove duplication and keep behavior
consistent.

In `@crates/types/src/request_id.rs`:
- Around line 44-48: The Default impl for RequestId currently returns a fresh
random ID (impl Default for RequestId -> fn default() -> Self { Self::new() }),
which is surprising; add a brief doc comment directly above the Default impl (or
augment the RequestId struct docs) stating that Default::default() produces a
new unique/random RequestId each call and is not a sentinel/zero value so
callers should not rely on a stable deterministic default; reference RequestId
and Default::default() in the comment so future maintainers and users know the
intended behavior.

In `@crates/utils/utils/src/metric_macros.rs`:
- Around line 80-83: Delete the dead internal macro arms for `@type_str` in
metric_macros.rs: remove the three orphaned arms `(`@type_str` counter) => {
"counter" };`, `(`@type_str` gauge) => { "gauge" };`, and `(`@type_str` histogram)
=> { "histogram" };` because no macro in the codebase invokes `@type_str`; after
removal, run a build/check to ensure no remaining macro references to
`@type_str` exist and adjust any tests if they relied on the removed helper.

In `@docker/observation/configs/grafana/dashboards/block-production.json`:
- Line 419: The panels currently use static legendFormat values ("blocks/s",
"retries", "rebuilds", "reorgs") which collapses all per-job series into
identical legends when the $node template is set with includeAll:true and
allValue:".*"; update each panel's legendFormat to include the job variable so
each series is distinguishable (e.g. replace "blocks/s" with "{{job}} blocks/s",
"retries" with "{{job}} retries", "rebuilds" with "{{job}} rebuilds", and
"reorgs" with "{{job}} reorgs"); make the same replacements for the other
occurrences noted (the entries at the other lines) so all per-job series keep
their {{job}} label in the legend.

In
`@docker/observation/configs/grafana/provisioning/datasources/datasources.yaml`:
- Around line 24-25: The current Grafana datasource settings use a very narrow
spanStartTimeShift and spanEndTimeShift of -1m / 1m which can miss logs in a
multi-node P2P system; update the values for spanStartTimeShift and
spanEndTimeShift (in the datasources.yaml snippet) to a wider window such as -5m
and 5m (or larger if needed) so cross-node trace log correlation tolerates OTLP
batching and clock skew.
- Around line 22-28: Replace the legacy tracesToLogs block with the newer
tracesToLogsV2 format: change the key from tracesToLogs to tracesToLogsV2,
switch to a customQuery that explicitly targets the Elasticsearch field
containing the trace ID (e.g., traceId or trace.id) instead of relying on tags,
and escape Grafana template variables by doubling the $ (e.g.,
$${__span.traceId}, $${__span.tags.request.id}) so provisioning does not expand
them; update the tags mapping or remove it if you fully target the trace field
in customQuery and verify the exact ES field name used by your OTLP exporter
before deploying.

---

Outside diff comments:
In `@crates/actors/src/validation_service.rs`:
- Around line 248-278: The current VDF handling in
process_vdf()/ConcurrentValidationResult drops the original request span so
BlockValidationFinished uses tracing::Span::current() (the validation loop span)
instead of the originating parent_span from ValidateBlock; update
active_validations::process_vdf() and the ConcurrentValidationResult to carry an
Option<tracing::Span> (e.g., parent_span) alongside request_id and validation
result, modify the callers that build ConcurrentValidationResult to attach the
original parent_span from ValidateBlock, and in validation_service when
constructing BlockValidationFinished replace span: tracing::Span::current() with
the propagated span (or fallback to current() if None) so the BlockTreeService
sees the original request's span as the parent.

In `@crates/p2p/src/block_pool.rs`:
- Around line 633-643: Add request_id to the tracing span for process_block by
including it in the #[instrument(...)] fields: update the attribute on
pub(crate) async fn process_block(...) to include request_id = ?request_id
alongside block.hash and block.height (e.g., fields(block.hash =
?block.header().block_hash, block.height = block.header().height, request_id =
?request_id)); keep skip_all if you need to avoid auto-capturing other args, but
ensure request_id is explicitly listed so traces include it.
- Around line 1058-1109: The background task spawned in
pull_and_seal_execution_payload_in_background detaches from the active tracing
span because tokio::spawn is called without inheriting the current span; fix by
instrumenting the spawned future so it inherits the current span (e.g., import
tracing::Instrument if missing) and call .in_current_span() (or
.instrument(tracing::Span::current())) on the async block passed to tokio::spawn
for the task that uses execution_payload_provider, gossip_broadcast_sender, and
chain_sync_sender so the span context is preserved into the background payload
fetch.

In `@crates/p2p/src/chain_sync.rs`:
- Around line 547-620: The spawned tasks for
SyncChainServiceMessage::RequestBlockFromTheNetwork,
::PullPayloadFromTheNetwork, and ::AttemptReprocessingBlock drop the current
tracing/OpenTelemetry span because they call tokio::spawn without
instrumentation; wrap each tokio::spawn(...) future with .in_current_span()
(Instrument as _ is already imported) so the request trace is propagated.
Concretely, apply .in_current_span() to the spawned futures that call
inner.request_parent_block(...),
inner.gossip_data_handler.pull_and_add_execution_payload_to_cache(...), and the
block reprocessing future that invokes inner.block_pool.process_block(...) so
the active span/traceparent is preserved across the spawn boundary.

In `@crates/utils/utils/src/metric_macros.rs`:
- Around line 1-20: Remove the dead macro arms named `@type_str` from the
`define_metrics!` macro in metric_macros.rs: delete the three orphaned
`@type_str` variants that handled `counter`, `gauge`, and `histogram` since they
are no longer referenced by any `define_metrics!` invocation; keep the top-level
macro arms that build `.u64_counter()`, `.u64_gauge()`, `.f64_histogram()`,
`.with_description()`, `.with_boundaries()`, and `.build()` intact and run tests
to ensure nothing else relied on the removed arms.

In `@docker/observation/configs/otel-collector/config.yaml`:
- Around line 62-63: The config declares a debug exporter block named "debug"
but never uses it; either remove or comment out the "debug:" exporter stanza or
add "debug" to the appropriate pipeline(s) under the service.pipelines section
so it becomes active during troubleshooting; locate the "debug" exporter key in
the config and either delete/comment that block or add "debug" to the desired
pipeline's exporters list (e.g., service.pipelines.<your_pipeline>.exporters).
- Around line 19-26: Remove the orphaned connectors block that defines the
spanmetrics connector: delete the entire connectors: section (which contains
spanmetrics with histogram, dimensions, and metrics_expiration) since
spanmetrics is no longer referenced by traces.exporters or metrics.receivers;
this prevents a future runtime failure where a connector is used only as an
exporter or receiver. Ensure no other connectors remain before removing the
connectors: block entirely.

coderabbitai

> [!CAUTION]

Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/p2p/src/server.rs (1)
647-665: 🧹 Nitpick | 🔵 Trivial

V2 block header/body handlers are missing the success-path info log present in their V1 counterparts.

handle_block_header_v1 emits info!("Node {:?}: Server handler handled block {} height {}", ...) in the spawn's else branch, and handle_block_body_v1 does the same. Neither handle_block_header_v2 (lines 647–665) nor handle_block_body_v2 (lines 712–729) has an equivalent else { info!(...) } branch, leaving successful V2 block processing silent in the logs.
♻️ Proposed fix — add success-path info logs to V2 block handlers

For handle_block_header_v2, inside the spawned task:
         if let Err(error) = server
             .data_handler
             .handle_block_header(v2_request, source_socket_addr)
             .await
         {
             Self::handle_invalid_data(&source_miner_address, &error, &server.peer_list);
             if !error.is_advisory() {
                 error!(
                     "Node {:?}: Failed to process the block {} height {}: {:?}",
                     this_node_id, block_hash, block_height, error
                 );
             }
+        } else {
+            info!(
+                "Node {:?}: Server handler handled block {} height {}",
+                this_node_id, block_hash, block_height
+            );
         }
Apply the same pattern to handle_block_body_v2.
Also applies to: 712-729
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/p2p/src/server.rs` around lines 647 - 665, The V2 handlers lack the
success-path info log present in V1: update the spawned tasks in
handle_block_header_v2 and handle_block_body_v2 so that after awaiting
server.data_handler.handle_block_header(...) / handle_block_body(...) you mirror
the V1 pattern — when the call returns Ok(...) emit an info! log (using the same
message shape as V1: include this_node_id, block_hash and block_height) in the
else branch; keep the existing Err handling intact so only add the info! call in
the success branch of those spawned async blocks.

♻️ Duplicate comments (1)

crates/utils/utils/src/telemetry.rs (1)

325-325: Composite propagator consideration (open item).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/utils/utils/src/telemetry.rs` at line 325, The current call sets only
TraceContextPropagator via
opentelemetry::global::set_text_map_propagator(TraceContextPropagator::new()); —
change this to use a CompositeTextMapPropagator that includes
TraceContextPropagator and BaggagePropagator (or make the set configurable) so
both trace context and baggage are propagated; update the call to construct
CompositeTextMapPropagator::new(vec![Box::new(TraceContextPropagator::new()),
Box::new(BaggagePropagator::new())]) (or equivalent) and add the necessary
imports to replace the single-propagator setup in telemetry.rs.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@crates/p2p/src/server.rs`:
- Around line 647-665: The V2 handlers lack the success-path info log present in
V1: update the spawned tasks in handle_block_header_v2 and handle_block_body_v2
so that after awaiting server.data_handler.handle_block_header(...) /
handle_block_body(...) you mirror the V1 pattern — when the call returns Ok(...)
emit an info! log (using the same message shape as V1: include this_node_id,
block_hash and block_height) in the else branch; keep the existing Err handling
intact so only add the info! call in the success branch of those spawned async
blocks.

---

Duplicate comments:
In `@crates/utils/utils/src/telemetry.rs`:
- Line 325: The current call sets only TraceContextPropagator via
opentelemetry::global::set_text_map_propagator(TraceContextPropagator::new()); —
change this to use a CompositeTextMapPropagator that includes
TraceContextPropagator and BaggagePropagator (or make the set configurable) so
both trace context and baggage are propagated; update the call to construct
CompositeTextMapPropagator::new(vec![Box::new(TraceContextPropagator::new()),
Box::new(BaggagePropagator::new())]) (or equivalent) and add the necessary
imports to replace the single-propagator setup in telemetry.rs.

ℹ️ Review info

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c608d2 and 58563e0.

📒 Files selected for processing (2)

crates/p2p/src/server.rs
crates/utils/utils/src/telemetry.rs

JesseTheRobot · 2026-02-23T18:54:17Z

+        let method = req.method().to_string();
+        let path = req
+            .match_pattern()
+            .unwrap_or_else(|| "unmatched".to_string());


we should probably be ignoring unmatched HTTP routes - but I'm not sure - thoughts?

I think it is better to err on the side of more information in the first instance. We can remove later

JesseTheRobot · 2026-02-23T18:54:44Z

+        Box::pin(async move {
+            let res = fut.await?;
+            let status = res.status().as_u16().to_string();
+            let duration_ms = start.elapsed().as_secs_f64() * 1000.0;


this is a very common pattern - maybe make it an extension trait method?

JesseTheRobot · 2026-02-23T18:59:20Z

+        let mut rng = rand::thread_rng();
+        loop {
+            rng.fill(&mut bytes);
+            if bytes != [0_u8; 8] {


interesting guard condition - what's the story?

The guard rejects all-zero span IDs because the OpenTelemetry spec reserves 0x0000000000000000 as "invalid / no span". If we emitted a zero span ID, backends (Tempo, Jaeger, etc.) would treat the span as nonexistent and drop it.

The probability of rng.gen::() == 0 is 1/2^64 (~5.4 × 10^-20), so the loop virtually never repeats — but the spec requires it.

JesseTheRobot · 2026-02-23T19:53:37Z

+            // Cache data is non-authoritative and can be rebuilt from chain state,
+            // so trade durability for write throughput by skipping fsync operations.
+            // On crash/corruption: delete the cache DB directory; it will be rebuilt
+            // on next startup from authoritative chain data.


nit: I don't think you need to delete the directory - I believe this sync mode will safely lose data without compromising the data that already exists (see the comment for SafeNoSync)

have changed the comment

JesseTheRobot

LGTM!

…d<T> wrapper

JesseTheRobot · 2026-02-24T17:11:44Z

+    pub span: tracing::Span,
+}
+
+impl<T> Traced<T> {


nit: maybe add an Into for all T into Traced? so we can just do .into()

JesseTheRobot · 2026-02-24T17:12:28Z

+    }
+
+    /// Destructure into (inner, span).
+    pub fn into_parts(self) -> (T, tracing::Span) {


this can also be am Into impl

we can also have a into_message fn that takes the span, activates it in the current thread, and then passes back T to the caller (this should work as spans are built on top of TLS)

JesseTheRobot · 2026-02-24T17:16:16Z

                    match cmd {
-                        Some(cmd) => {
+                        Some(traced) => {
+                            let (cmd, _parent_span) = traced.into_parts();


is not using the span intentional?

JesseTheRobot · 2026-02-24T17:18:22Z

        let process_remaining = async {
-            while let Ok(msg) = self.msg_rx.try_recv() {
+            while let Ok(traced) = self.msg_rx.try_recv() {
+                let (msg, _parent_span) = traced.into_parts();


another parent span not being used

JesseTheRobot · 2026-02-24T17:19:25Z

+    fn elapsed_ms(&self) -> f64;
+}
+
+impl ElapsedMs for std::time::Instant {


JesseTheRobot

LGTM! just some nonblocking feedback

glottologist added 25 commits February 20, 2026 11:58

fix(logging): use a self hosted stack for monitoring instead of Axiom

1c82b62

fix(obs): address review comments

05dc572

fix(obs): use actix-web metric support rather than hand-rolled

34a8a57

fix(obs): address review comments

0748882

fix(obs): use spanmetrics connector to instrument methods automatically

04f39e0

fix(obs): add additional useful metrics

eabfaae

fix(obs): split out dashboards and add gossip error metrics

f8f0a27

fix(obs): moved dashboard location

64af4b5

fix(obs): review comments

259f2dd

fix(obs): review comments

71f7afd

fix(obs): review comments

83a4f30

chore: review comments

5955999

fix(logging): use a self hosted stack for monitoring instead of Axiom

22eef01

fix(obs): address review comments

7a2aaeb

fix(obs): add additional useful metrics

f8fb7f3

fix(obs): split out dashboards and add gossip error metrics

ad615be

feat(database): add SafeNoSync for cache DB and cache_chunk_verified

74ca136

feat(mempool): batch block confirmation to reduce lock contention

76cadaf

fix: resolve rebase conflicts

c971a2f

refactor(metrics): replace OnceLock boilerplate with define_metrics! …

ae2b360

…macro

chore: review comments

803a8eb

chore: review comments

be30c9c

feat(tracing): add cross-node request lifecycle tracing

dc78244

feat(tracing): add trace to logs correlation to grafana

d8c1128

chore(tracing): review comments

6eca8b3

chore(tracing): review comments

c949747

glottologist marked this pull request as ready for review February 20, 2026 15:24

coderabbitai Bot reviewed Feb 20, 2026

View reviewed changes

chore(tracing): review comments

76b5d31

coderabbitai Bot reviewed Feb 23, 2026

View reviewed changes

chore: review comments

0cc567a

glottologist requested a review from JesseTheRobot February 23, 2026 18:20

JesseTheRobot reviewed Feb 23, 2026

View reviewed changes

Comment thread crates/actors/src/mempool_service.rs

JesseTheRobot reviewed Feb 23, 2026

View reviewed changes

Comment thread crates/utils/utils/src/telemetry.rs

JesseTheRobot reviewed Feb 23, 2026

View reviewed changes

feat: Optimize trace and span ID generation

257b161

JesseTheRobot reviewed Feb 23, 2026

View reviewed changes

Comment thread crates/p2p/src/gossip_data_handler.rs

JesseTheRobot approved these changes Feb 23, 2026

View reviewed changes

glottologist added 4 commits February 24, 2026 13:18

feat(tracing): add span click through and fix indexing and dashboards

0060376

refactor(tracing): replace per-variant span fields with generic Trace…

252ebf9

…d<T> wrapper

refactor(tracing): extract common patterns

b4389e9

docs(tracing): review comment

ae59d64

JesseTheRobot reviewed Feb 24, 2026

View reviewed changes

JesseTheRobot approved these changes Feb 24, 2026

View reviewed changes

refactor(tracing): review comments

5bc339b

glottologist merged commit 8e430bf into master Feb 24, 2026
25 of 26 checks passed

glottologist deleted the jason/enhanced_tracing branch February 24, 2026 18:26

This was referenced Feb 25, 2026

fix(tracing): add missing spans #1161

Merged

test: stability #1142

Merged

coderabbitai Bot mentioned this pull request Mar 9, 2026

feat: gossip wire types #1184

Merged

3 tasks

This was referenced May 10, 2026

feat(metrics): expose MDBX metrics via Reth's /metrics endpoint #1409

Merged

fix: validation error disambiguation #1416

Merged

Conversation

glottologist commented Feb 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Cross-Node Trace Propagation (crates/p2p)

Request ID (crates/types)

Intra-Node Span Propagation (crates/actors)

JSON Structured Logging (crates/chain, crates/utils)

Design Decision: Why RequestId Instead of Tracing Span IDs

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JesseTheRobot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JesseTheRobot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

glottologist commented Feb 20, 2026 •

edited by coderabbitai Bot

Loading

Cross-Node Trace Propagation (`crates/p2p`)

Request ID (`crates/types`)

Intra-Node Span Propagation (`crates/actors`)

JSON Structured Logging (`crates/chain`, `crates/utils`)

Design Decision: Why `RequestId` Instead of Tracing Span IDs

coderabbitai Bot commented Feb 20, 2026 •

edited

Loading