Skip to content

TKAI-2: add session tracing propagation#1

Open
figitaki wants to merge 5 commits into
mainfrom
carey/tkai-2-session-tracing
Open

TKAI-2: add session tracing propagation#1
figitaki wants to merge 5 commits into
mainfrom
carey/tkai-2-session-tracing

Conversation

@figitaki
Copy link
Copy Markdown
Collaborator

@figitaki figitaki commented May 6, 2026

Summary

  • add a dependency-free OTLP/HTTP JSON tracer shared by worker and runner
  • propagate W3C traceparent through sandbox env, DO→runner protocol messages, and runner→OpenCode HTTP calls
  • instrument worker lifecycle/dispatch spans plus runner bootstrap, repo setup, turn, workflow, tool, and LLM usage spans
  • document TKAI-2 implementation coverage in the session tracing spec

Test plan

  • Unit tests
  • Smoke test
  • Deploy

Need to set.

OTEL_EXPORTER_OTLP_ENDPOINT=<grafana otlp endpoint>
OTEL_EXPORTER_OTLP_HEADERS=<auth headers>

Verification

  • pnpm --filter @valet/shared typecheck
  • pnpm --filter @valet/runner typecheck
  • pnpm --filter @valet/worker typecheck
  • pnpm --filter @valet/runner test -- src/prompt.test.ts
  • pnpm --filter @valet/worker exec vitest run src/durable-objects/prompt-queue.test.ts src/durable-objects/runner-link.test.ts

Refs TKAI-2

@figitaki figitaki requested review from benturnkey and f3nry May 7, 2026 02:55
this.tracer = new SimpleTracer({
serviceName: 'valet-worker',
endpoint: this.env.OTEL_EXPORTER_OTLP_ENDPOINT,
headers: this.env.OTEL_EXPORTER_OTLP_HEADERS,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we need these?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the auth headers / endpoints that receive the telemetry since there's no intermediate grafana agent until this is running in k8s

Copy link
Copy Markdown

@f3nry f3nry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flush batching — see inline comment.


private startPromptDispatchSpan(attrs: SpanAttributes): SimpleSpan {
const queuedAt = this.promptQueue.promptReceivedAt;
const waitMs = queuedAt > 0 ? Date.now() - queuedAt : 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every call site does span.end() then this.flushTracing(), which fires a separate ctx.waitUntil(tracer.flush()) — and each flush() POSTs all accumulated spans to the OTLP endpoint. During a session lifecycle (spawn → dispatch → dispatch → hibernate → restore → dispatch…) this generates a lot of small HTTP requests to Grafana Cloud.

Consider batching: either flush on a timer/threshold inside SimpleTracer (e.g. every 5s or 50 spans), or consolidate flush calls to session-level boundaries rather than per-span. The Runner side already naturally batches because tool/LLM spans accumulate and only flush at the turn finally block — the Worker should do the same.

figitaki added 3 commits May 8, 2026 11:50
…-flush

msToUnixNano multiplied a JS number by 1_000_000, exceeding MAX_SAFE_INTEGER
for present-day epoch milliseconds and silently rounding span timestamps.

Adds maxQueuedSpans + scheduleFlush options so a hot tracer can auto-flush
in batches once a buffer threshold is hit, with a host hook for ctx.waitUntil.

Addresses review feedback on #1 and yourbuddyconner#45.
The prompt handler signature had grown to 14 positional args (messageId,
content, model, author, modelPreferences, attachments, channelType,
channelId, opencodeSessionId, continuationContext, threadId,
replyChannelType, replyChannelId, traceparent), making call sites and
the wire shape unreviewable.

Introduces PromptDispatch / PromptHandlerFn so onPrompt and handlePrompt
take a single typed object. Updates the agent-client dispatcher,
PromptHandler.handlePrompt, the bin.ts callback, and the prompt unit
tests to use the new shape.

Addresses figitaki review on yourbuddyconner#45.
Previously every span ended in a session-agent dispatch path called
flushTracing() inline, fanning out to a separate ctx.waitUntil(flush())
per span and POSTing the buffer to OTLP each time. Across spawn →
dispatch → dispatch → hibernate → restore lifecycles this generated
many small HTTP requests against Grafana Cloud.

Configures SimpleTracer with maxQueuedSpans=50 and a scheduleFlush hook
that registers auto-flush promises with ctx.waitUntil. Drops the
per-span flushTracing() calls at dispatch sites; explicit flushes remain
at session-level boundaries (hibernate, terminate, wake guard, child
session error finally blocks) where the DO may go idle before the
threshold is hit.

Addresses f3nry review on #1.
@figitaki figitaki requested a review from f3nry May 9, 2026 00:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants