Skip to content

Conversation

@Cyb3rWard0g
Copy link
Collaborator

This PR introduces a complete observability solution for Dapr Agents, enabling distributed tracing and monitoring through OpenTelemetry with Phoenix UI support. The module provides automatic instrumentation for agents, tools, LLM calls, and workflow executions while maintaining W3C Trace Context standards for distributed tracing across Dapr boundaries. All observability features are optional dependencies that gracefully degrade when not installed.

Key Changes

New Observability Module (dapr_agents/observability/)

  • Automatic Instrumentation: Zero-code tracing for agents, tools, LLM interactions, and workflows
  • Optional Dependencies: Clean fallback behavior when observability packages aren't installed
  • W3C Trace Context: Standards-compliant context propagation across Dapr Workflow boundaries
  • Phoenix UI Integration: Rich visualization and analysis through OpenInference semantic conventions

Core Components

Instrumentor (DaprAgentsInstrumentor)

  • Main entry point for enabling observability
  • Automatic discovery and wrapping of key components
  • Configurable span processors and exporters

Wrapper Classes

  • AgentWrapper: Traces agent conversations and reasoning flows
  • LLMWrapper: Captures LLM calls with token usage and message processing
  • ToolWrapper: Monitors tool executions with input/output tracking
  • WorkflowWrapper: Traces workflow orchestration and task execution
  • WorkflowTaskWrapper: Detailed task-level tracing within workflows

Context Propagation (context_propagation.py)

  • W3C Trace Context format support for Dapr serialization
  • extract_otel_context() and restore_otel_context() utilities
  • Proper parent-child span relationships across workflow boundaries

Message Processing (message_processors.py)

  • Converts various message formats to OpenInference standard
  • Tool schema extraction and serialization
  • Token usage tracking and LLM response processing

Constants and Utilities

  • OpenInference semantic conventions with fallback values
  • Availability detection for optional dependencies
  • Safe JSON serialization with error handling

Documentation Updates

  • Added observability section to quickstart README
  • Docker Compose setup for Phoenix with PostgreSQL
  • Step-by-step instrumentation guide
  • Troubleshooting and best practices

Features

Distributed Tracing

  • Complete trace hierarchy from agent conversations to individual tool calls
  • Context propagation across async boundaries and Dapr workflows
  • Correlation IDs for grouping related operations

Performance Monitoring

  • Response times for all operations
  • Token usage and cost tracking for LLM calls
  • Error rates and failure analysis
  • Tool execution performance metrics

Rich Visualization

  • Phoenix UI compatibility with proper span relationships
  • Message flow visualization with input/output content
  • Tool schema display and parameter tracking
  • Workflow execution timelines

Graceful Degradation

  • Zero impact when observability packages not installed
  • Automatic fallback to no-op implementations
  • Clear error messages with installation guidance

Technical Implementation

W3C Trace Context Support

  • Proper traceparent and tracestate header handling
  • Compatible with Dapr's serialization mechanisms
  • Maintains trace continuity across workflow restarts

OpenInference Compliance

  • Standard semantic conventions for AI/ML observability
  • Proper message formatting for Phoenix UI
  • Tool call tracking with function schemas

Optional Dependency Pattern

  • Clean import handling with try/except blocks
  • Availability flags throughout the codebase
  • Helpful error messages guiding users to install extras

Breaking Changes

None. All observability features are opt-in and don't affect existing functionality.

Signed-off-by: Roberto Rodriguez <9653181+Cyb3rWard0g@users.noreply.github.com>
Signed-off-by: Roberto Rodriguez <9653181+Cyb3rWard0g@users.noreply.github.com>
Signed-off-by: Roberto Rodriguez <9653181+Cyb3rWard0g@users.noreply.github.com>
Signed-off-by: Roberto Rodriguez <9653181+Cyb3rWard0g@users.noreply.github.com>
Signed-off-by: Roberto Rodriguez <9653181+Cyb3rWard0g@users.noreply.github.com>
@Cyb3rWard0g Cyb3rWard0g requested a review from yaron2 as a code owner July 30, 2025 03:20
@Cyb3rWard0g Cyb3rWard0g requested review from sicoyle and yaron2 and removed request for yaron2 July 30, 2025 03:20
Signed-off-by: Roberto Rodriguez <9653181+Cyb3rWard0g@users.noreply.github.com>
Signed-off-by: Roberto Rodriguez <9653181+Cyb3rWard0g@users.noreply.github.com>
Copy link
Collaborator

@sicoyle sicoyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments so far. Overall loooking goood and super neat feature add! I'd say I don't think we have to be as verbose in the comments since the code is pretty clear, and in logs I think we can trim down on some and I'm not sure we want emojis in the logs. Runtime Dapr had comments recently saying to avoid emojis in print outs, so maybe we should follow suite there?



# Global instance for workflow context storage across the application
_context_storage = WorkflowContextStorage()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want this global to the entire application? An application can have multiple agents, therefore one agent could access another agents storage here potentially right? I'm not sure we want that. Can the workflow ids as keys be namespaced by agent name maybe to ensure that one agent cannot access another agents storage here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm can you elaborate on this? when we run dapr run --app-id weatherapp . that is 1 agent right? We use the instance_id / workflow_id here to aggregate logs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dapr run cli cmd will run a single app https://docs.dapr.io/reference/cli/dapr-run/ but a single app in dapr agents world (app python code) can contain multiple agents technically since to some extent they are merely python objects a user can instantiate.

Alternatively, you can have the agent with the @task(agent=custom_agent, syntax. Would we get otel metrics on the agent with the task syntax?
For example something like this,

# Define simple agents
extractor = Agent(
    name="DestinationExtractor",
    role="Extract destination",
    instructions=["Extract the main city from the user query"]
)

planner = Agent(
    name="PlannerAgent",
    role="Outline planner",
    instructions=["Generate a 3-day outline for the destination"]
)

expander = Agent(
    name="ItineraryAgent",
    role="Itinerary expander",
    instructions=["Expand the outline into a detailed plan"]
)

# Workflow tasks
@task(agent=extractor)
def extract(user_msg: str) -> str:
    pass

@task(agent=planner)
def plan(destination: str) -> str:
    pass

@task(agent=expander)
def expand(outline: str) -> str:
    pass

# Orchestration
@workflow(name="chained_planner_workflow")
def chained_planner_workflow(ctx: DaprWorkflowContext, user_msg: str):
    dest = yield ctx.call_activity(extract, input=user_msg)
    outline = yield ctx.call_activity(plan, input=dest)
    itinerary = yield ctx.call_activity(expand, input=outline)
    return itinerary

So yeah in a case such as this I'm not sure how the context sharing would be safe across all agents since they'd be within the same appid.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG! I ran your example (I had to do a minor fix to the agent as a task which was not related to observabiity), and it worked! I loved that it kept everything under One trace starting from the "Workflow". Look:

image image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Observabiity module was able to trace and connect each actions from the WorkflowApp as the root node. Then 3 tasks were identified and traced as workflow tasks . However, each task spawned an AI Agent. Each agent went through 1 iteration / 1 loop and each loop requested a chat completion. 🔥 Adding @yaron2 ;)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow executing successfully too based on the app logs ;)
image

}

if model:
attributes[LLM_MODEL_NAME] = model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if a user is running an app with say 3 agents within it and each using a diff model. How will this work if it is a single global const?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question. I have not tested it with multiple agents and different modules.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this scenario too :) and same everything is captured the right way:

image image image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😉 @yaron2

Returns:
Dict[str, Any]: Span attributes for input messages
"""
if OPENINFERENCE_AVAILABLE:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this would be cleaner if instead of global var set we used a field on the agent or if this is enabled or yeah just a bool on one of the classes. Would that be doable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm I think global makes more sense tbh since we are defining wrappers for all methods used by Agents , Durable Agents, and any agentic workflows. With the previous comments you had about potential issues on having multiple agents under the same app and multi model tasks, everything works as expected.

try:
# Use AgentTool's built-in function call format
if hasattr(tool, "to_function_call"):
function_call = tool.to_function_call(format_type="openai")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the case that every llm provider that we have abides by the openai format?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question. I need to do more research on that.

Comment on lines +82 to +102
def strip_method_args(arguments: Mapping[str, Any]) -> Dict[str, Any]:
"""
Remove self/cls arguments from method parameters.
Filters out 'self' and 'cls' parameters from bound arguments to avoid
including instance/class references in span attributes, following the
SmolAgents pattern for cleaner tracing data.
Args:
arguments: Dictionary of bound method arguments
Returns:
Dict[str, Any]: Filtered arguments without self/cls
Example:
>>> strip_method_args({'self': obj, 'param': 'value', 'cls': MyClass})
{'param': 'value'}
"""
return {
key: value for key, value in arguments.items() if key not in ("self", "cls")
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is smolagents pattern the ideal pattern we're following? why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compared to other openinference packages for other frameworks, smolagents is easy to follow to learn how instrumentation / observability is enabled.

"""
# Check for instrumentation suppression
if context_api and context_api.get_value(
context_api._SUPPRESS_INSTRUMENTATION_KEY
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sets suppress_instrumentation in the open telemetry context

_SUPPRESS_INSTRUMENTATION_KEY = create_key("suppress_instrumentation")

that is then used somewhere else to set another key

def _instrumented_requests_call(
        method: str, url: str, call_wrapped, get_or_create_headers
    ):
        if context.get_value("suppress_instrumentation") or context.get_value(
            _SUPPRESS_REQUESTS_INSTRUMENTATION_KEY
        ):
            return call_wrapped()

And that key _SUPPRESS_REQUESTS_INSTRUMENTATION_KEY is a:

# A key to a context variable to avoid creating duplicate spans when instrumenting
# both, Session.request and Session.send, since Session.request calls into Session.send
_SUPPRESS_REQUESTS_INSTRUMENTATION_KEY = "suppress_requests_instrumentation"

Some context from OpenTelemetry docs: https://opentelemetry-python-kinvolk.readthedocs.io/en/latest/_modules/opentelemetry/instrumentation/requests.html and where it is set: https://github.com/pexip/os-python-opentelemetry-api/blob/bad159831b8ba321068a4a6b06c282c8737b94a4/src/opentelemetry/context/__init__.py#L171

Cyb3rWard0g and others added 8 commits August 1, 2025 02:35
Signed-off-by: Roberto Rodriguez <9653181+Cyb3rWard0g@users.noreply.github.com>
…quest if not set

Signed-off-by: Roberto Rodriguez <9653181+Cyb3rWard0g@users.noreply.github.com>
Signed-off-by: Roberto Rodriguez <9653181+Cyb3rWard0g@users.noreply.github.com>
Signed-off-by: Roberto Rodriguez <9653181+Cyb3rWard0g@users.noreply.github.com>
Signed-off-by: Roberto Rodriguez <9653181+Cyb3rWard0g@users.noreply.github.com>
Signed-off-by: Roberto Rodriguez <9653181+Cyb3rWard0g@users.noreply.github.com>
Signed-off-by: Roberto Rodriguez <9653181+Cyb3rWard0g@users.noreply.github.com>
Copy link
Member

@yaron2 yaron2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yaron2 yaron2 merged commit 963cef6 into main Aug 1, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants