feat: add cross-node request lifecycle tracing and observability improvements#1156
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughPropagates tracing::Span across actor messages, validation tasks, mempool ingress, API routes, and P2P gossip; adds HTTP request metrics middleware, OpenTelemetry trace-context propagation, UUIDv7 trace ID generation, metric adjustments, and related test updates. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API_Server
participant Mempool
participant Validation
participant BlockTree
participant P2P
rect rgba(0,128,255,0.5)
Client->>API_Server: HTTP request (creates tracing span)
API_Server->>Mempool: Ingest*Tx(..., span=Span::current())
end
rect rgba(0,200,100,0.5)
Mempool->>Validation: ValidateBlock(block, skip_vdf, span)
Validation->>BlockTree: BlockValidationFinished(hash, result, span)
end
rect rgba(200,100,0,0.5)
BlockTree->>P2P: Gossip outbound (attach traced headers)
P2P->>External: HTTP call with traced headers (inject_trace_context)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (7)
docker/observation/configs/otel-collector/config.yaml (2)
62-63: 🧹 Nitpick | 🔵 TrivialUnused
debugexporter — remove or comment out.The
debugexporter is declared but never added to any pipeline, so it is silently disabled. Configuring an exporter does not enable it; exporters are enabled by adding them to the appropriate pipelines within the service section. Either wire it into a pipeline when troubleshooting is needed, or remove it to keep the configuration free of dead declarations.🧹 Proposed cleanup
- # Debug exporter for troubleshooting (optional) - debug: - verbosity: basic - service:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docker/observation/configs/otel-collector/config.yaml` around lines 62 - 63, The config declares a debug exporter block named "debug" but never uses it; either remove or comment out the "debug:" exporter stanza or add "debug" to the appropriate pipeline(s) under the service.pipelines section so it becomes active during troubleshooting; locate the "debug" exporter key in the config and either delete/comment that block or add "debug" to the desired pipeline's exporters list (e.g., service.pipelines.<your_pipeline>.exporters).
19-26:⚠️ Potential issue | 🟠 MajorRemove the now-orphaned
spanmetricsconnector definition.The
spanmetricsconnector was removed from both its pipeline roles —traces.exporters(line 75) andmetrics.receivers(line 77) — but its full definition still lives in theconnectors:block. Configuring a connector doesn't enable it; connectors are enabled through pipelines within the service section. So the collector won't error on startup today. However, leaving the definition in place is dangerous: each connector MUST be used as both, in separate pipelines, so if a future edit re-wiresspanmetricsinto only one pipeline role (e.g., only back intotraces.exporters), the collector will fail at runtime with"failed to build pipelines: connector spanmetrics used as exporter … but not used in any supported receiver pipeline". This class of error is confirmed by real-world collector failures documented in the upstream tracker.Delete the entire
connectors:section since no other connector is defined.🗑️ Proposed fix: remove orphaned connector block
-connectors: - spanmetrics: - histogram: - explicit: - buckets: [1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s, 30s] - dimensions: - - name: service.name - metrics_expiration: 5m - processors:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docker/observation/configs/otel-collector/config.yaml` around lines 19 - 26, Remove the orphaned connectors block that defines the spanmetrics connector: delete the entire connectors: section (which contains spanmetrics with histogram, dimensions, and metrics_expiration) since spanmetrics is no longer referenced by traces.exporters or metrics.receivers; this prevents a future runtime failure where a connector is used only as an exporter or receiver. Ensure no other connectors remain before removing the connectors: block entirely.crates/utils/utils/src/metric_macros.rs (1)
1-20: 🧹 Nitpick | 🔵 TrivialRemove orphaned
@type_strmacro arms — they are no longer invoked.The three
@type_strarms (counter, gauge, histogram) were previously used when the macro generated aMETRICSdescriptor array. After removal of that code path, these arms became dead code. Search of the entire codebase confirms no invocations of@type_strexist; alldefine_metrics!calls use only the top-level macro syntax.The OpenTelemetry 0.31.0 builder API is correctly implemented:
.u64_counter(),.u64_gauge(),.f64_histogram(),.with_description(),.with_boundaries(), and.build()all match the public API for that version.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/utils/utils/src/metric_macros.rs` around lines 1 - 20, Remove the dead macro arms named `@type_str` from the `define_metrics!` macro in metric_macros.rs: delete the three orphaned `@type_str` variants that handled `counter`, `gauge`, and `histogram` since they are no longer referenced by any `define_metrics!` invocation; keep the top-level macro arms that build `.u64_counter()`, `.u64_gauge()`, `.f64_histogram()`, `.with_description()`, `.with_boundaries()`, and `.build()` intact and run tests to ensure nothing else relied on the removed arms.crates/actors/src/validation_service.rs (1)
248-278: 🧹 Nitpick | 🔵 TrivialConsider propagating the original request's
parent_spanthrough toBlockValidationFinished.All three
BlockValidationFinishedconstruction sites usespan: tracing::Span::current(), which captures the validation service's event-loop span rather than the originalparent_spanpassed viaValidateBlock. As a result,BlockTreeServiceoperations triggered by validation results appear as children of the validation service's loop span in traces — not of the original API/gossip request that initiated the validation.The
request_idcorrectly provides end-to-end correlation (per the PR's design rationale), so this doesn't affect functional correctness. But the span hierarchy is broken at this boundary: the full chainoriginal_request_span → validation_stages → block_tree_handlingis not represented.If richer span hierarchies are desired, the fix is to propagate
parent_spanalongsiderequest_idthroughprocess_vdf()andConcurrentValidationResult:♻️ Optional: carry parent_span through to BlockValidationFinished
In
active_validations.rs, extendprocess_vdf()return type andConcurrentValidationResult:-) -> Option<(BlockHash, VdfValidationResult, Option<irys_types::RequestId>)> +) -> Option<(BlockHash, VdfValidationResult, Option<irys_types::RequestId>, tracing::Span)>pub struct ConcurrentValidationResult { pub block_hash: BlockHash, pub validation_result: ValidationResult, pub request_id: Option<irys_types::RequestId>, + pub parent_span: tracing::Span, }Then in
validation_service.rs, replacespan: tracing::Span::current()with the propagated span in allBlockValidationFinishedsites.Also applies to: 282-295
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/actors/src/validation_service.rs` around lines 248 - 278, The current VDF handling in process_vdf()/ConcurrentValidationResult drops the original request span so BlockValidationFinished uses tracing::Span::current() (the validation loop span) instead of the originating parent_span from ValidateBlock; update active_validations::process_vdf() and the ConcurrentValidationResult to carry an Option<tracing::Span> (e.g., parent_span) alongside request_id and validation result, modify the callers that build ConcurrentValidationResult to attach the original parent_span from ValidateBlock, and in validation_service when constructing BlockValidationFinished replace span: tracing::Span::current() with the propagated span (or fallback to current() if None) so the BlockTreeService sees the original request's span as the parent.crates/p2p/src/block_pool.rs (2)
633-643: 🧹 Nitpick | 🔵 Trivial
request_idmissing from#[instrument]span fields.The
skip_allattribute excludesrequest_idfrom the auto-generated span, so the request ID won't be visible in traces for the block processing span. Given the PR's explicit goal of per-request tracing correlation, adding it tofields(...)would be valuable.🔭 Add `request_id` to instrument fields
#[instrument( skip_all, target = "BlockPool", - fields(block.hash = ?block.header().block_hash, block.height = block.header().height), + fields(block.hash = ?block.header().block_hash, block.height = block.header().height, request_id = ?request_id), )]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/p2p/src/block_pool.rs` around lines 633 - 643, Add request_id to the tracing span for process_block by including it in the #[instrument(...)] fields: update the attribute on pub(crate) async fn process_block(...) to include request_id = ?request_id alongside block.hash and block.height (e.g., fields(block.hash = ?block.header().block_hash, block.height = block.header().height, request_id = ?request_id)); keep skip_all if you need to avoid auto-capturing other args, but ensure request_id is explicitly listed so traces include it.
1058-1109: 🧹 Nitpick | 🔵 TrivialBackground task drops span context — breaks the intra-node trace chain.
tokio::spawnat Line 1071 doesn't call.in_current_span(), so the spawned task detaches from the active OpenTelemetry span. Any traceparent propagated via the PR's W3C tracecontext machinery won't extend into this background payload fetch. This directly undermines the PR's intra-node span propagation goal.🔭 Inherit span context in the background task
Add the import at the top of the file (if not already present):
+use tracing::Instrument as _;Then instrument the spawned task:
- tokio::spawn(async move { + tokio::spawn(async move { match Self::pull_and_seal_execution_payload( &execution_payload_provider, &chain_sync_sender, evm_block_hash, use_trusted_peers_only, None, request_id, ) .await { // ... } - }); + }.in_current_span());🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/p2p/src/block_pool.rs` around lines 1058 - 1109, The background task spawned in pull_and_seal_execution_payload_in_background detaches from the active tracing span because tokio::spawn is called without inheriting the current span; fix by instrumenting the spawned future so it inherits the current span (e.g., import tracing::Instrument if missing) and call .in_current_span() (or .instrument(tracing::Span::current())) on the async block passed to tokio::spawn for the task that uses execution_payload_provider, gossip_broadcast_sender, and chain_sync_sender so the span context is preserved into the background payload fetch.crates/p2p/src/chain_sync.rs (1)
547-620: 🧹 Nitpick | 🔵 TrivialThree updated task spawns in
handle_messagedrop the span context.
RequestBlockFromTheNetwork(Line 557),PullPayloadFromTheNetwork(Line 579), andAttemptReprocessingBlock(Line 599) all spawn tasks that now propagaterequest_idbut don't call.in_current_span(). The active OpenTelemetry span (and any injectedtraceparent) is lost at each spawn boundary.Instrument as _is already imported (Line 23); the fix is identical for all three.🔭 Propagate span context into all three message-handler spawns
// RequestBlockFromTheNetwork handler - tokio::spawn(async move { + tokio::spawn(async move { let result = inner .request_parent_block(parent_block_hash, request_id) .await; if let Some(sender) = response { if let Err(e) = sender.send(result) { tracing::error!("Failed to send response: {:?}", e); } } - }); + }.in_current_span()); // PullPayloadFromTheNetwork handler - tokio::spawn(async move { + tokio::spawn(async move { let result = inner .gossip_data_handler .pull_and_add_execution_payload_to_cache( evm_block_hash, use_trusted_peers_only, request_id, ) .await; if let Err(e) = response.send(result) { tracing::error!("Failed to send response: {:?}", e); } - }); + }.in_current_span()); // AttemptReprocessingBlock handler - tokio::spawn(async move { + tokio::spawn(async move { if let Some(cached_block) = inner.block_pool.get_cached_block(&block_hash).await { if let Err(e) = inner .block_pool .process_block( Arc::clone(&cached_block.block), cached_block.is_fast_tracking, cached_block.request_id, ) .await { error!("Failed to reprocess block {:?}: {:?}", block_hash, e); } } else { error!( "Cannot reprocess block {:?} - not found in cache", block_hash ); } - }); + }.in_current_span());🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/p2p/src/chain_sync.rs` around lines 547 - 620, The spawned tasks for SyncChainServiceMessage::RequestBlockFromTheNetwork, ::PullPayloadFromTheNetwork, and ::AttemptReprocessingBlock drop the current tracing/OpenTelemetry span because they call tokio::spawn without instrumentation; wrap each tokio::spawn(...) future with .in_current_span() (Instrument as _ is already imported) so the request trace is propagated. Concretely, apply .in_current_span() to the spawned futures that call inner.request_parent_block(...), inner.gossip_data_handler.pull_and_add_execution_payload_to_cache(...), and the block reprocessing future that invokes inner.block_pool.process_block(...) so the active span/traceparent is preserved across the spawn boundary.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/actors/src/validation_service/active_validations.rs`:
- Around line 354-359: The code creates two identical bindings request_id and
task_request_id from task.request_id inside the VdfValidationResult::Valid arm;
remove the duplicate by capturing request_id once (e.g., before the match or use
task.request_id directly in the Valid arm) and reference that single binding
wherever task_request_id was used (including the return near where request_id is
expected), ensuring all uses of task_request_id are updated to the single
request_id binding.
In `@crates/chain/src/main.rs`:
- Around line 90-112: The init_tracing code duplicates log-format selection
logic from telemetry.rs::setup_tracing_subscriber (reading IRYS_LOG_FORMAT and
building JSON vs plain fmt layers) and also omits
.with_span_events(FmtSpan::NONE) present in telemetry.rs; refactor by extracting
the shared formatter construction into a single helper (e.g., make_fmt_layer()
-> Box<dyn Layer<...> + Send + Sync>) placed in irys-utils, update
telemetry.rs::setup_tracing_subscriber and crates/chain::init_tracing to call
make_fmt_layer() and only handle selection/initialization there so both JSON and
plain-text options reuse the exact same configuration (including
.with_span_events(FmtSpan::NONE)) to remove duplication and keep behavior
consistent.
In `@crates/types/src/request_id.rs`:
- Around line 44-48: The Default impl for RequestId currently returns a fresh
random ID (impl Default for RequestId -> fn default() -> Self { Self::new() }),
which is surprising; add a brief doc comment directly above the Default impl (or
augment the RequestId struct docs) stating that Default::default() produces a
new unique/random RequestId each call and is not a sentinel/zero value so
callers should not rely on a stable deterministic default; reference RequestId
and Default::default() in the comment so future maintainers and users know the
intended behavior.
In `@crates/utils/utils/src/metric_macros.rs`:
- Around line 80-83: Delete the dead internal macro arms for `@type_str` in
metric_macros.rs: remove the three orphaned arms `(`@type_str` counter) => {
"counter" };`, `(`@type_str` gauge) => { "gauge" };`, and `(`@type_str` histogram)
=> { "histogram" };` because no macro in the codebase invokes `@type_str`; after
removal, run a build/check to ensure no remaining macro references to
`@type_str` exist and adjust any tests if they relied on the removed helper.
In `@docker/observation/configs/grafana/dashboards/block-production.json`:
- Line 419: The panels currently use static legendFormat values ("blocks/s",
"retries", "rebuilds", "reorgs") which collapses all per-job series into
identical legends when the $node template is set with includeAll:true and
allValue:".*"; update each panel's legendFormat to include the job variable so
each series is distinguishable (e.g. replace "blocks/s" with "{{job}} blocks/s",
"retries" with "{{job}} retries", "rebuilds" with "{{job}} rebuilds", and
"reorgs" with "{{job}} reorgs"); make the same replacements for the other
occurrences noted (the entries at the other lines) so all per-job series keep
their {{job}} label in the legend.
In
`@docker/observation/configs/grafana/provisioning/datasources/datasources.yaml`:
- Around line 24-25: The current Grafana datasource settings use a very narrow
spanStartTimeShift and spanEndTimeShift of -1m / 1m which can miss logs in a
multi-node P2P system; update the values for spanStartTimeShift and
spanEndTimeShift (in the datasources.yaml snippet) to a wider window such as -5m
and 5m (or larger if needed) so cross-node trace log correlation tolerates OTLP
batching and clock skew.
- Around line 22-28: Replace the legacy tracesToLogs block with the newer
tracesToLogsV2 format: change the key from tracesToLogs to tracesToLogsV2,
switch to a customQuery that explicitly targets the Elasticsearch field
containing the trace ID (e.g., traceId or trace.id) instead of relying on tags,
and escape Grafana template variables by doubling the $ (e.g.,
$${__span.traceId}, $${__span.tags.request.id}) so provisioning does not expand
them; update the tags mapping or remove it if you fully target the trace field
in customQuery and verify the exact ES field name used by your OTLP exporter
before deploying.
---
Outside diff comments:
In `@crates/actors/src/validation_service.rs`:
- Around line 248-278: The current VDF handling in
process_vdf()/ConcurrentValidationResult drops the original request span so
BlockValidationFinished uses tracing::Span::current() (the validation loop span)
instead of the originating parent_span from ValidateBlock; update
active_validations::process_vdf() and the ConcurrentValidationResult to carry an
Option<tracing::Span> (e.g., parent_span) alongside request_id and validation
result, modify the callers that build ConcurrentValidationResult to attach the
original parent_span from ValidateBlock, and in validation_service when
constructing BlockValidationFinished replace span: tracing::Span::current() with
the propagated span (or fallback to current() if None) so the BlockTreeService
sees the original request's span as the parent.
In `@crates/p2p/src/block_pool.rs`:
- Around line 633-643: Add request_id to the tracing span for process_block by
including it in the #[instrument(...)] fields: update the attribute on
pub(crate) async fn process_block(...) to include request_id = ?request_id
alongside block.hash and block.height (e.g., fields(block.hash =
?block.header().block_hash, block.height = block.header().height, request_id =
?request_id)); keep skip_all if you need to avoid auto-capturing other args, but
ensure request_id is explicitly listed so traces include it.
- Around line 1058-1109: The background task spawned in
pull_and_seal_execution_payload_in_background detaches from the active tracing
span because tokio::spawn is called without inheriting the current span; fix by
instrumenting the spawned future so it inherits the current span (e.g., import
tracing::Instrument if missing) and call .in_current_span() (or
.instrument(tracing::Span::current())) on the async block passed to tokio::spawn
for the task that uses execution_payload_provider, gossip_broadcast_sender, and
chain_sync_sender so the span context is preserved into the background payload
fetch.
In `@crates/p2p/src/chain_sync.rs`:
- Around line 547-620: The spawned tasks for
SyncChainServiceMessage::RequestBlockFromTheNetwork,
::PullPayloadFromTheNetwork, and ::AttemptReprocessingBlock drop the current
tracing/OpenTelemetry span because they call tokio::spawn without
instrumentation; wrap each tokio::spawn(...) future with .in_current_span()
(Instrument as _ is already imported) so the request trace is propagated.
Concretely, apply .in_current_span() to the spawned futures that call
inner.request_parent_block(...),
inner.gossip_data_handler.pull_and_add_execution_payload_to_cache(...), and the
block reprocessing future that invokes inner.block_pool.process_block(...) so
the active span/traceparent is preserved across the spawn boundary.
In `@crates/utils/utils/src/metric_macros.rs`:
- Around line 1-20: Remove the dead macro arms named `@type_str` from the
`define_metrics!` macro in metric_macros.rs: delete the three orphaned
`@type_str` variants that handled `counter`, `gauge`, and `histogram` since they
are no longer referenced by any `define_metrics!` invocation; keep the top-level
macro arms that build `.u64_counter()`, `.u64_gauge()`, `.f64_histogram()`,
`.with_description()`, `.with_boundaries()`, and `.build()` intact and run tests
to ensure nothing else relied on the removed arms.
In `@docker/observation/configs/otel-collector/config.yaml`:
- Around line 62-63: The config declares a debug exporter block named "debug"
but never uses it; either remove or comment out the "debug:" exporter stanza or
add "debug" to the appropriate pipeline(s) under the service.pipelines section
so it becomes active during troubleshooting; locate the "debug" exporter key in
the config and either delete/comment that block or add "debug" to the desired
pipeline's exporters list (e.g., service.pipelines.<your_pipeline>.exporters).
- Around line 19-26: Remove the orphaned connectors block that defines the
spanmetrics connector: delete the entire connectors: section (which contains
spanmetrics with histogram, dimensions, and metrics_expiration) since
spanmetrics is no longer referenced by traces.exporters or metrics.receivers;
this prevents a future runtime failure where a connector is used only as an
exporter or receiver. Ensure no other connectors remain before removing the
connectors: block entirely.
There was a problem hiding this comment.
> [!CAUTION]
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)crates/p2p/src/server.rs (1)
647-665: 🧹 Nitpick | 🔵 TrivialV2 block header/body handlers are missing the success-path info log present in their V1 counterparts.
handle_block_header_v1emitsinfo!("Node {:?}: Server handler handled block {} height {}", ...)in the spawn'selsebranch, andhandle_block_body_v1does the same. Neitherhandle_block_header_v2(lines 647–665) norhandle_block_body_v2(lines 712–729) has an equivalentelse { info!(...) }branch, leaving successful V2 block processing silent in the logs.♻️ Proposed fix — add success-path info logs to V2 block handlers
For
handle_block_header_v2, inside the spawned task:if let Err(error) = server .data_handler .handle_block_header(v2_request, source_socket_addr) .await { Self::handle_invalid_data(&source_miner_address, &error, &server.peer_list); if !error.is_advisory() { error!( "Node {:?}: Failed to process the block {} height {}: {:?}", this_node_id, block_hash, block_height, error ); } + } else { + info!( + "Node {:?}: Server handler handled block {} height {}", + this_node_id, block_hash, block_height + ); }Apply the same pattern to
handle_block_body_v2.Also applies to: 712-729
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/p2p/src/server.rs` around lines 647 - 665, The V2 handlers lack the success-path info log present in V1: update the spawned tasks in handle_block_header_v2 and handle_block_body_v2 so that after awaiting server.data_handler.handle_block_header(...) / handle_block_body(...) you mirror the V1 pattern — when the call returns Ok(...) emit an info! log (using the same message shape as V1: include this_node_id, block_hash and block_height) in the else branch; keep the existing Err handling intact so only add the info! call in the success branch of those spawned async blocks.
♻️ Duplicate comments (1)
crates/utils/utils/src/telemetry.rs (1)
325-325: Composite propagator consideration (open item).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/utils/utils/src/telemetry.rs` at line 325, The current call sets only TraceContextPropagator via opentelemetry::global::set_text_map_propagator(TraceContextPropagator::new()); — change this to use a CompositeTextMapPropagator that includes TraceContextPropagator and BaggagePropagator (or make the set configurable) so both trace context and baggage are propagated; update the call to construct CompositeTextMapPropagator::new(vec![Box::new(TraceContextPropagator::new()), Box::new(BaggagePropagator::new())]) (or equivalent) and add the necessary imports to replace the single-propagator setup in telemetry.rs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@crates/p2p/src/server.rs`:
- Around line 647-665: The V2 handlers lack the success-path info log present in
V1: update the spawned tasks in handle_block_header_v2 and handle_block_body_v2
so that after awaiting server.data_handler.handle_block_header(...) /
handle_block_body(...) you mirror the V1 pattern — when the call returns Ok(...)
emit an info! log (using the same message shape as V1: include this_node_id,
block_hash and block_height) in the else branch; keep the existing Err handling
intact so only add the info! call in the success branch of those spawned async
blocks.
---
Duplicate comments:
In `@crates/utils/utils/src/telemetry.rs`:
- Line 325: The current call sets only TraceContextPropagator via
opentelemetry::global::set_text_map_propagator(TraceContextPropagator::new()); —
change this to use a CompositeTextMapPropagator that includes
TraceContextPropagator and BaggagePropagator (or make the set configurable) so
both trace context and baggage are propagated; update the call to construct
CompositeTextMapPropagator::new(vec![Box::new(TraceContextPropagator::new()),
Box::new(BaggagePropagator::new())]) (or equivalent) and add the necessary
imports to replace the single-propagator setup in telemetry.rs.
| let method = req.method().to_string(); | ||
| let path = req | ||
| .match_pattern() | ||
| .unwrap_or_else(|| "unmatched".to_string()); |
There was a problem hiding this comment.
we should probably be ignoring unmatched HTTP routes - but I'm not sure - thoughts?
There was a problem hiding this comment.
I think it is better to err on the side of more information in the first instance. We can remove later
| Box::pin(async move { | ||
| let res = fut.await?; | ||
| let status = res.status().as_u16().to_string(); | ||
| let duration_ms = start.elapsed().as_secs_f64() * 1000.0; |
There was a problem hiding this comment.
this is a very common pattern - maybe make it an extension trait method?
| let mut rng = rand::thread_rng(); | ||
| loop { | ||
| rng.fill(&mut bytes); | ||
| if bytes != [0_u8; 8] { |
There was a problem hiding this comment.
interesting guard condition - what's the story?
There was a problem hiding this comment.
The guard rejects all-zero span IDs because the OpenTelemetry spec reserves 0x0000000000000000 as "invalid / no span". If we emitted a zero span ID, backends (Tempo, Jaeger, etc.) would treat the span as nonexistent and drop it.
The probability of rng.gen::() == 0 is 1/2^64 (~5.4 × 10^-20), so the loop virtually never repeats — but the spec requires it.
| // Cache data is non-authoritative and can be rebuilt from chain state, | ||
| // so trade durability for write throughput by skipping fsync operations. | ||
| // On crash/corruption: delete the cache DB directory; it will be rebuilt | ||
| // on next startup from authoritative chain data. |
There was a problem hiding this comment.
nit: I don't think you need to delete the directory - I believe this sync mode will safely lose data without compromising the data that already exists (see the comment for SafeNoSync)
There was a problem hiding this comment.
have changed the comment
| pub span: tracing::Span, | ||
| } | ||
|
|
||
| impl<T> Traced<T> { |
There was a problem hiding this comment.
nit: maybe add an Into for all T into Traced? so we can just do .into()
| } | ||
|
|
||
| /// Destructure into (inner, span). | ||
| pub fn into_parts(self) -> (T, tracing::Span) { |
There was a problem hiding this comment.
this can also be am Into impl
There was a problem hiding this comment.
we can also have a into_message fn that takes the span, activates it in the current thread, and then passes back T to the caller (this should work as spans are built on top of TLS)
| match cmd { | ||
| Some(cmd) => { | ||
| Some(traced) => { | ||
| let (cmd, _parent_span) = traced.into_parts(); |
There was a problem hiding this comment.
is not using the span intentional?
| let process_remaining = async { | ||
| while let Ok(msg) = self.msg_rx.try_recv() { | ||
| while let Ok(traced) = self.msg_rx.try_recv() { | ||
| let (msg, _parent_span) = traced.into_parts(); |
There was a problem hiding this comment.
another parent span not being used
| fn elapsed_ms(&self) -> f64; | ||
| } | ||
|
|
||
| impl ElapsedMs for std::time::Instant { |
JesseTheRobot
left a comment
There was a problem hiding this comment.
LGTM! just some nonblocking feedback
Describe the changes
Before:
Each node was traced independently with no correlation across node boundaries. Gossip HTTP requests carried no trace context, making it impossible to follow a request from ingestion through validation. JSON structured logging was commented out and unavailable.
After:
Implements trace context propagation on all gossip HTTP requests, explicit
RequestId(UUID v7) onGossipRequestV2messages,tracing::Spanpropagation through all inter-service message enums, and runtime-selectable JSON structured logging viaIRYS_LOG_FORMAT=json.Changes
Cross-Node Trace Propagation (
crates/p2p)traceparent/tracestateheaders on all outbound gossip HTTP requests viainject_trace_context()using OpenTelemetry'sTextMapPropagatorTraceContextPropagatorglobally intelemetry.rsHeaderInjectoradapter forreqwest::header::HeaderMapget_info,request_peer_list,handshake_v1/v2,request_block_index,get_protocol_version,health_check,send_preserialized,send_data_internalRequest ID (
crates/types)RequestIdtype —UUID v7Option<RequestId>field toGossipRequestV2with#[serde(default)]for backward compatibility with older peersRequestIdwhen not present, then create named spans (gossip.handle_chunk,gossip.handle_block_header, etc.) with the ID as a span fieldIntra-Node Span Propagation (
crates/actors)span: tracing::Spanandrequest_id: Option<RequestId>fields to:BlockDiscoveryMessage::BlockDiscoveredBlockTreeServiceMessage::BlockPreValidatedBlockTreeServiceMessage::BlockValidationFinishedValidationServiceMessage::ValidateBlockMempoolServiceMessage::IngestDataTxFromApi/GossipMempoolServiceMessage::IngestCommitmentTxFromApi/Gossipparent: &parent_spanlinking the full request lifecycleBlockValidationTaskcarriesparent_spanandrequest_idthrough VDF scheduling and concurrent validationJSON Structured Logging (
crates/chain,crates/utils)IRYS_LOG_FORMAT=jsonruntime toggle in bothinit_tracing()(non-telemetry path) andsetup_tracing_subscriber()(telemetry path)with_current_span(true)andwith_span_list(true)for full span context in each log lineDesign Decision: Why
RequestIdInstead of Tracing Span IDsThis PR uses both an explicit
RequestIdandtracing::Spanpropagation. They serve different purposes and are not interchangeable. Tracing span IDs have properties that make them unsuitable as cross-node request correlation keys.IdlacksSerialise)Why not reuse span IDs as request IDs?
sharded_slab::Poolindices (small sequential integers starting at 1). Slots are reused after spans are dropped. Two nodes produce overlapping ID spaces, and the same node reuses IDs over time.tracing_core::span::IdderivesClone, Debug, PartialEq, Eq, Hashonly — noSerialise/Deserialise/Display. The inneru64is accessible viainto_u64()but is a raw slab index with no semantic meaning outside the originating subscriber.tracing-subscriberJSON formatter'sSerializableSpanserialises span name and fields but not the span ID. There is no built-in way to grep for a span ID in log output.RequestIduses UUID v7 with a 48-bit millisecond timestamp, making IDs naturally time-ordered and sortable.tokio::spawndoes not propagate it automatically (requires.instrument()), and channel sends do not carry it.Why not derive
RequestIdfrom the span ID (e.g. embed it in the constructor)?RequestIdat the same millisecond with the same slab index would collide.RequestIdmay have been dropped and its slab slot reused by the time someone reads the ID on a remote node. Extracting the span ID from theRequestIdwould point at the wrong (or no) span.RequestIdis recorded as a span field (request.id = %request_id), so every span already carries itsRequestId. You can search from span → RequestId (via the field) and from RequestId → spans (via grep/trace search). There is no additional benefit to embedding the span ID in theRequestId.Related Issue(s)
Please link to the issue(s) that will be closed with this PR.
Checklist
Additional Context
Add any other context about the pull request here.
Summary by CodeRabbit
New Features
Improvements
Other