How to evaluate agents in practice

Key Concepts

Three-Tier Testing Pyramid: A framework for evaluating AI agents, comprising Component Level Unit Tests (Tier 1), Trajectory Level Integration Tests (Tier 2), and End-to-End Human Review (Tier 3).
Component Level Unit Tests (Tier 1): Tests the smallest building blocks of an agent in isolation, focusing on tool use, prompt interpretation, and response structure (e.g., JSON validity). These are fast, cheap, and automated.
Trajectory Level Integration Tests (Tier 2): Tests a full multi-step task end-to-end, assessing the agent's planning, tool order, adaptability to failures, memory, and goal achievement.
End-to-End Human Review (Tier 3): Involves human oversight to evaluate the overall user experience, including helpfulness, safety, and common sense. This tier is slow and expensive but serves as the final quality gate.
Trajectory: The entire sequence of actions an agent takes to solve a task, including tool calls with arguments, intermediate reasoning steps, and the final response.
Agent Development Kit (ADK): A toolset used to build and run automated checks for AI agents, following the testing pyramid framework.
Trajectory Average Score: An ADK metric that measures how closely the actual sequence of tool calls matches the expected sequence. A score of 1 requires a 100% match, while 0 allows complete divergence. A threshold between 0 and 1 is typically used.
Response Match Score: An ADK metric that assesses the similarity between the agent's final answer and the expected answer, using ROUGE-1 for word overlap (precision and recall). A threshold between 0 and 1 is used to allow for natural language variations.
Nondeterministic Agents: AI agents whose outputs can vary even with the same input, often due to parameters like temperature. This necessitates avoiding strict matching criteria (e.g., a score of 1) unless temperature is set to zero.

Part 1: The Three-Tier Testing Pyramid

The video introduces a three-tier testing pyramid as a framework for evaluating AI agents, addressing the limitations of traditional software testing for AI.

Tier 1: Component Level Unit Tests
- Focus: Testing the smallest, isolated building blocks of an agent.
- Examples: Verifying if the agent selects the correct tool from a prompt, or if it generates a valid, correctly structured response (e.g., correct JSON fields).
- Characteristics: Fast, cheap, and automated, making them ideal for continuous integration to catch regressions early.
- Purpose: Ensures the correctness of individual components like tool usage and parameter generation.
Tier 2: Trajectory Level Integration Tests
- Focus: Testing a full, multi-step task from beginning to end.
- Key Questions: Did the agent plan logically? Did it use tools in the correct order? Did it adapt if something failed? Did it remember previous information and reach the goal?
- Purpose: Evaluates the agent's capability and reasoning abilities.
Tier 3: End-to-End Human Review
- Focus: Involving humans in the loop to assess the overall user experience.
- Evaluation Criteria: Helpfulness, safety, common sense.
- Characteristics: Slow and more expensive, but serves as the final quality gate.
- Purpose: Enhances user experience and builds trust.

Part 2: ADK in Action - Evaluating the Bookfinder Agent

This section demonstrates how to apply the three-tier testing pyramid using the Google Agent Development Kit (ADK) with a "Bookfinder" agent as a case study.

Bookfinder Agent Example:
- Model: Uses Gemini 2.5 Pro as its "brain."
- Tools:
  1. Search local library tool
  2. Find local bookstore tool
  3. Order online tool
- User Query Example: "Order this book Harry Potter for me."
- Agent Logic: The agent first checks the local library, then a local bookstore. If the book is not found in either local option, it proceeds to order online.
What to Measure:
- Trajectory: The entire journey the agent takes to solve the task. This includes:
  - The sequence of tool calls and their arguments.
  - Intermediate steps and reasoning.
  - The final response.
- Final Answer: The ultimate output provided to the user, such as recommending an online retailer or confirming local options failed.
ADK Built-in Metrics for Evaluation:
- Trajectory Average Score:
  - Function: Measures the similarity between the actual sequence of tool calls and the expected sequence.
  - Scoring: A score of 1 means 100% match required; 0 means no match required. A threshold between 0 and 1 is set to define passing criteria.
- Response Match Score:
  - Function: Assesses how similar the final answer is to the expected answer.
  - Method: Uses ROUGE-1 to check word overlap, combining precision and recall into a score from 0 to 1.
  - Threshold: A threshold (e.g., 0.5 for "close enough," 0.8 for "very strict") is chosen to accommodate natural language variations.
  - Note: Due to the nondeterministic nature of agents, avoiding a score of 1 is recommended unless temperature is set to zero.
Designing Tests with ADK:
- Tier 1 Component Tests:
  - Goal: Test simple prompts and expected tool calls for correctness.
  - Checks: Tool selection accuracy (e.g., did it pick Search local library?) and parameter correctness (e.g., did it pass title=Harry Potter correctly?).
  - Action: Fix any failures at this foundational level before proceeding.
- Tier 2 Trajectory Tests:
  - Goal: Evaluate multi-step plans and final answer quality.
  - Method: Use ADK eval with multi-step tests, defining expected tool sequences.
  - Criteria: Set thresholds for Trajectory Average Score (e.g., 0.8 for a close match to the plan) and Response Match Score (e.g., 0.5 for natural language variation in the final response).
- Tier 3 Human Review:
  - Leverage: ADK provides traces, enabling human reviewers to judge overall quality.

Demo: Using ADK for Agent Testing

The video includes a demonstration of ADK in practice.

Tier 1 Component Test Example:
- Goal: Verify if a single tool call is selected and filled correctly.
- Checks: Tool selection accuracy and parameter correctness.
- Outcome: Failures indicate issues with tool usage or parameter generation that need immediate fixing.
Tier 2 Trajectory Test Example:
- Goal: Ensure the agent follows the expected journey and produces a good final response.
- Setup: Example code is shown for setting passing criteria using Trajectory Average Score (e.g., 0.8) and Response Match Score (e.g., 0.5).
- Execution: The test is run, and results are displayed.

Conclusion

The video concludes by summarizing the practical application of AI agent evaluation using the three-tier testing pyramid and ADK.

Process Recap:
1. Tier 1: Fast, cheap component checks for correctness.
2. Tier 2: ADK eval for full trajectory and final answer quality, assessing capability and reasoning.
3. Tier 3: Human review to guarantee a great user experience and trust.
Key Takeaway: This structured approach, combined with ADK's built-in tools, allows developers to move from an assumption that an agent "works" to a state of knowing it is "very reliable."
Series Wrap-up: This episode, following a previous one on the theory of agent evaluation, provides practical steps and hands-on demonstrations. Users are now equipped to create their own evaluation sets and test files with ADK.

How to evaluate agents in practice

Key Concepts

Part 1: The Three-Tier Testing Pyramid

Part 2: ADK in Action - Evaluating the Bookfinder Agent

Demo: Using ADK for Agent Testing

Conclusion

Chat with this Video

Related Videos

Ready to summarize another video?