Back to all videos

Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft

By AI Engineer

observability evaluation and safety.Constraint: No broad terms (e.g.

Share:

Key Concepts

Agent Observability: The practice of monitoring, tracing, and evaluating AI agents to ensure reliability, safety, and performance.
Non-determinism: The inherent unpredictability of AI agents, necessitating robust evaluation and monitoring frameworks.
Microsoft Foundry: A cloud-based platform for building, hosting, observing, and managing AI agents.
Trace-Linked Evaluations: A methodology where evaluation results are directly linked to specific execution traces, allowing for rapid diagnosis of failures.
Red Teaming: Proactive adversarial testing where an AI agent is attacked to identify vulnerabilities in guardrails and safety mechanisms.
Workflow Agents: A design pattern where complex tasks are broken down into specialized sub-agents orchestrated by a central controller.
Coding Agents (Skills): AI-driven tools (e.g., GitHub Copilot skills) that automate the development, evaluation, and optimization loop.

1. The "Mind the Gap" Framework

The presenters use the "Mind the Gap" analogy to describe the discrepancy between initial agent requirements and real-world performance.

Evaluation: Checking quality and change over time.
Safety: Implementing guardrails to protect against malicious user inputs.
Monitoring: Maintaining visibility across the agent lifecycle as environments and requirements evolve.

2. The Agent Development Lifecycle

The session outlines a three-phase approach to agent development:

Build: Creating the agent using models, instructions, and tools (e.g., web search).
Optimize: Using data from evaluations to refine prompts, switch models, or adjust tool usage.
Govern/Monitor: Managing agents at scale, ensuring they remain within safety parameters.

3. Technical Methodologies

Tracing: Built on the OpenTelemetry (OTEL) standard, allowing developers to track tool calls, message flows, and latency across multi-agent systems.
Evaluation Metrics:
- Quality: Coherence, fluency, and task adherence.
- Safety: Detecting prompt injections and malicious intent.
- Agentic: Intent resolution and tool-calling accuracy.
Red Teaming Strategies:
- Prohibited Actions: Defining a taxonomy of forbidden behaviors and testing if the agent can be manipulated into performing them.
- Crescendo Attacks: A sophisticated, multi-step attack strategy that gradually pushes an agent to bypass safety guardrails.

4. Tools and Resources

Microsoft Foundry Portal: A centralized UI for managing projects, deploying agents, and viewing telemetry.
GitHub Copilot "Observe" Skill: An AI-driven tool that automates the evaluation loop, generates test datasets, and suggests prompt optimizations.
Azure Monitor: Integrated for infrastructure-level telemetry, allowing developers to correlate AI performance with underlying cloud resources.
Dev Containers: Used in GitHub Codespaces to provide a pre-configured environment with all necessary dependencies, reducing setup friction.

5. Key Arguments and Insights

Human-in-the-Loop: While AI agents can automate the optimization loop (e.g., prompt tuning), human oversight is essential to determine when an agent has reached its "best version" and to prevent over-optimization or regression.
Diagnosis over Detection: The primary goal of observability is to minimize the time between detecting a failure and diagnosing its root cause. Linking evaluations to traces is the most effective way to achieve this.
Scalability: As organizations move from single agents to multi-agent systems, centralized observability becomes critical to manage fleet-wide performance and security.

6. Notable Quotes

"Agents are non-deterministic. That's not just a problem for demos; that's also a problem for real life when you actually get to production." — Amy Boyd
"The difference between evaluation and safeguarding is that evaluation assumes your users are acting normally, while safeguarding assumes a malicious user is trying to break your solution." — Nitia

7. Synthesis/Conclusion

The session emphasizes that agent observability is not an "add-on" but a core requirement for production-grade AI. By leveraging Microsoft Foundry, OpenTelemetry, and AI-driven coding skills, developers can bridge the gap between prototype and production. The most actionable takeaway is the "Eval-Optimize Loop": build an agent, run batch evaluations, analyze the traces to diagnose failures, optimize the prompt or model, and repeat—ideally using automated coding agents to accelerate the process while maintaining human control.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video