Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize
By AI Engineer
Key Concepts
- Evaluation (Eval): A test for AI systems to ensure they function correctly, moving beyond "vibes" (subjective feeling) to objective measurement.
- Traces & Spans: The fundamental data structure for AI observability. A trace records the entire execution of an agent, while spans are the individual building blocks (LLM calls, tool calls, agent turns).
- Code Evals: Deterministic, fast, and cheap tests (e.g., checking for JSON format, length limits, or specific keywords).
- LLM-as-a-Judge: Using a more powerful LLM to evaluate the output of a production LLM based on a specific rubric (a set of rules/criteria).
- Meta-Evaluation: The process of testing the evaluator itself to ensure it is judging correctly, often by comparing it against a "golden data set" of human-verified answers.
- Capability vs. Regression Evals: Capability evals test new functionality (climbing a hill), while regression evals ensure existing functionality remains intact.
- Impact Hierarchy: A framework for prioritizing improvements: 1. Data Quality, 2. Prompt Engineering, 3. Model Selection, 4. Hyperparameter Tuning.
1. The Necessity of Evaluation
Lori Voss argues that AI agents are notoriously difficult to test because they are non-deterministic—the same prompt can yield different, yet correct, outputs. Traditional unit tests fail here. Without formal evals, teams cannot safely update system prompts or switch models without risking "cascading failures," where a minor change causes the agent to hallucinate or fail on edge cases.
2. The Evaluation Framework
The presentation outlines a systematic process for building an eval suite:
- Instrumentation: Use tools like
open-inferenceto capture traces. - Observation: Read raw traces to identify patterns of failure.
- Categorization: Use an LLM to group failure types (e.g., hallucination, reasoning gaps).
- Eval Implementation:
- Code Evals: For deterministic checks (e.g., "Did it mention the stock ticker?").
- LLM-as-a-Judge: For semantic checks (e.g., "Is this report actionable?").
- Iteration: Use experiments to test prompt changes against a "golden data set" of failures.
3. Designing Custom LLM-as-a-Judge Rubrics
A high-quality rubric should include:
- Role Definition: Assigning the judge a persona (e.g., "Expert Financial Analyst").
- Explicit Criteria: Defining "actionable" vs. "not actionable" with specific, observable examples.
- Clear Data Boundaries: Using XML tags (e.g.,
<data>) to separate input, output, and context. - Few-Shot Examples: Providing the judge with concrete examples of success and failure.
- Constrained Output: Forcing a binary or ternary choice (e.g., "Actionable" vs. "Not Actionable") rather than a vague 1–10 rating.
- Chain of Thought: Instructing the judge to "think out loud" before providing a final label.
4. Real-World Applications & Strategies
- The Swiss Cheese Model: No single eval is perfect. Layering code evals, LLM judges, and human review ensures that the "holes" in one layer are covered by another.
- Data Flywheel: As you collect more production data and human-annotated examples, your "golden data set" grows, creating a competitive moat and a more robust testing suite.
- Precision vs. Recall: Depending on the use case (e.g., medical screening vs. spam filtering), teams must decide whether to minimize false negatives (missing a failure) or false positives (flagging a success as a failure).
- Cost-Aware Evaluation: Use smaller, cheaper models (e.g., Claude Haiku) for the agent and larger, more capable models (e.g., Claude Sonnet/Opus) for the judge.
5. Notable Quotes
- "Validating that you haven't validated is just a fancy way of being wrong at scale."
- "If you can't define what 'great' means, you're not going to be able to write an eval that checks for it."
- "The first time a regression shows up before it reaches your users instead of after, you will have justified the cost of building your evals."
6. Synthesis/Conclusion
Evaluation is not a one-time task but a continuous loop. By moving from "vibes" to structured observability (Phoenix), teams can treat prompts like code. The most effective strategy is to start with simple code evals, build capability evals to address specific failure modes, and eventually transition those into a regression suite. The ultimate goal is to create a data-driven feedback loop where the agent's failures directly inform the next iteration of the prompt, leading to systematic, measurable improvement.
Chat with this Video
AI-PoweredLoad the transcript when you're ready to chat so the initial page stays lighter.