[RFC] ML-Commons Agent Tracing & Observability

### Summary
Given this feature request: #3970 

As AI agent systems become more prevalent in OpenSearch, there is an increasing need for robust tracing and observability capabilities to provide insights into agent behavior and performance. The current tracing implementation using memory in ML-Commons is limited in its ability to capture the complexity of agent workflows, especially for non-linear executions and multi-agent interactions. This RFC proposes a new approach leveraging OpenTelemetry to provide comprehensive tracing and observability for ML-Commons agents.

### Design Tenets

- Provide fine-grained tracing of agent operations, including planning, execution, and tool usage.
- Ensure that agent tasks start true root spans for accurate trace trees.
- Attach meaningful attributes (e.g., agent/task/tool/model details) to each span for better searchability and diagnostics.
- Integrate tracing with minimal performance impact and without disrupting existing agent logic.
- Allow future extension to other agent types, tools, and telemetry backends.

### Design
I propose introducing a new tracing layer for ML agents, centered around the MLAgentTracer and AbstractMLTracer classes. All agent runners (starting with MLPlanExecuteAndReflectAgentRunner) will use this tracer to create, manage, and end spans for key operations.

<img width="487" height="395" alt="Image" src="https://github.com/user-attachments/assets/a6033fc5-b34e-43a2-82da-bf120f38fffc" />

Class Structure:
1. AbstractMLTracer
    * Base class for ML-specific tracers
    * Wraps OpenSearch's Tracer and provides utility methods for span creation and ending
2. MLAgentTracer
    * Singleton, initialized at plugin startup, extending AbstractMLTracer
    * Provides methods to start agent spans with explicit parent/child relationships
3. MLPlanExecuteAndReflectAgentRunner
    * Uses MLAgentTracer for all major operations (agent task, planning, execution steps, tool invocations, state transitions)
    * Adds rich attributes to each span (agent name, task, tool, model, etc.)
    * Maintains correct parent/child relationships between spans for trace hierarchy
4. AgentUtils
    * Provides utility methods for attribute construction and span creation

Integration with Agent Runners:
<img width="971" height="803" alt="Image" src="https://github.com/user-attachments/assets/f44a4e35-29c3-486f-a49e-0f397efa445c" />

Span Hierarchy:
<img width="267" height="342" alt="Image" src="https://github.com/user-attachments/assets/12a49b09-c1ca-4ead-81bd-b785af1b5bf1" />

    
### Persistence
I will leverage OpenTelemetry's data collection capabilities and OpenSearch's existing infrastructure for trace persistence. All trace data will be collected via the OpenTelemetry SDK and exported through the OpenTelemetry Collector, which serves as the primary ingestion point for trace data. The collector will process and route traces to their designated storage destinations.

Traces will be stored in the otel-v1-apm-span-agent index within OpenSearch, allowing seamless integration with existing trace analytics features in OpenSearch Dashboards. This index will utilize the standard OpenTelemetry trace data model to maintain compatibility with existing tooling and visualization capabilities. However, it will also have static mappings for pre-defined span attributes. Data Prepper will handle the processing and transformation of raw OpenTelemetry data before indexing, enabling any necessary enrichment or formatting specific to ML agent traces.

<img width="244" height="452" alt="Image" src="https://github.com/user-attachments/assets/769ab8ce-5f4b-4154-b9b3-6157676e9386" />

The persistence layer is designed with future extensibility in mind, particularly to accommodate potential integration with specialized backends for advanced trace analysis. The OpenTelemetry Collector's multi-exporter capability provides the flexibility to add support for additional storage backends as requirements evolve, without requiring changes to the core tracing implementation.

![Image](https://github.com/user-attachments/assets/ba8da7c4-7b73-4a5f-aaf9-745f3e1ad0e8)

### Visualization
Trace visualization will be implemented using OpenSearch Dashboards, leveraging and extending its existing trace analytics capabilities to provide intuitive, interactive views of ML agent executions. I will develop custom visualizations tailored to the unique needs of ML agent workflows, offering both high-level overviews and detailed drill-down capabilities.

The primary visualizations will be a timeline-based waterfall visualization and hierarchical tree view. This will offer a chronological perspective of the agent's operations, clearly showing the sequence and duration of each step as well as parent-child relationships. Users will be able to pan through the timeline and expand nested spans for more detailed inspection.

Complementing the list view, I will provide a hierarchical graph view. Nodes in this graph will correspond to different span types (e.g., agent tasks, planning steps, execution steps, tool calls), with edges representing the flow and dependencies between operations. This graph view will allow users to quickly grasp the overall structure of an agent's execution, including branching decisions, loops, and parallel operations.

Both visualizations will be interactive, allowing users to click on individual nodes or spans to inspect detailed attributes, such as input/output data, performance metrics, and error information.

### Implementation

1. Integrate OpenTelemetry into ML-Commons
2. Implement MLAgentTracer and AbstractMLTracer
3. Modify agent runners to use the new agent tracer 
4. Implement span creation for key agent operations
5. Add rich attribute collection for spans
6. Integrate with OpenSearch's existing OpenTelemetry exporter
7. Extend OpenSearch Dashboards for ML-specific trace visualization

**Do you have any additional context?**
- OpenTelemetry is already being utilized in OpenSearch core, query-insights, and performance analyzer, providing a solid foundation for this implementation.
- Similar solutions exist in tools like LangSmith, demonstrating proven value in this approach.
- The implementation will initially focus on the Plan-Execute-Reflect agent, with potential extension to other agent types in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] ML-Commons Agent Tracing & Observability #3971

Summary

Design Tenets

Design

Persistence

Visualization

Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] ML-Commons Agent Tracing & Observability #3971

Description

Summary

Design Tenets

Design

Persistence

Visualization

Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions