How to evaluate agents in practice
By Google Cloud Tech
Key Concepts
- Three-Tier Testing Pyramid: A framework for evaluating AI agents, comprising Component Level Unit Tests (Tier 1), Trajectory Level Integration Tests (Tier 2), and End-to-End Human Review (Tier 3).
- Component Level Unit Tests (Tier 1): Tests the smallest building blocks of an agent in isolation, focusing on tool use, prompt interpretation, and response structure (e.g., JSON validity). These are fast, cheap, and automated.
- Trajectory Level Integration Tests (Tier 2): Tests a full multi-step task end-to-end, assessing the agent's planning, tool order, adaptability to failures, memory, and goal achievement.
- End-to-End Human Review (Tier 3): Involves human oversight to evaluate the overall user experience, including helpfulness, safety, and common sense. This tier is slow and expensive but serves as the final quality gate.
- Trajectory: The entire sequence of actions an agent takes to solve a task, including tool calls with arguments, intermediate reasoning steps, and the final response.
- Agent Development Kit (ADK): A toolset used to build and run automated checks for AI agents, following the testing pyramid framework.
- Trajectory Average Score: An ADK metric that measures how closely the actual sequence of tool calls matches the expected sequence. A score of 1 requires a 100% match, while 0 allows complete divergence. A threshold between 0 and 1 is typically used.
- Response Match Score: An ADK metric that assesses the similarity between the agent's final answer and the expected answer, using ROUGE-1 for word overlap (precision and recall). A threshold between 0 and 1 is used to allow for natural language variations.
- Nondeterministic Agents: AI agents whose outputs can vary even with the same input, often due to parameters like temperature. This necessitates avoiding strict matching criteria (e.g., a score of 1) unless temperature is set to zero.
Part 1: The Three-Tier Testing Pyramid
The video introduces a three-tier testing pyramid as a framework for evaluating AI agents, addressing the limitations of traditional software testing for AI.
-
Tier 1: Component Level Unit Tests
- Focus: Testing the smallest, isolated building blocks of an agent.
- Examples: Verifying if the agent selects the correct tool from a prompt, or if it generates a valid, correctly structured response (e.g., correct JSON fields).
- Characteristics: Fast, cheap, and automated, making them ideal for continuous integration to catch regressions early.
- Purpose: Ensures the correctness of individual components like tool usage and parameter generation.
-
Tier 2: Trajectory Level Integration Tests
- Focus: Testing a full, multi-step task from beginning to end.
- Key Questions: Did the agent plan logically? Did it use tools in the correct order? Did it adapt if something failed? Did it remember previous information and reach the goal?
- Purpose: Evaluates the agent's capability and reasoning abilities.
-
Tier 3: End-to-End Human Review
- Focus: Involving humans in the loop to assess the overall user experience.
- Evaluation Criteria: Helpfulness, safety, common sense.
- Characteristics: Slow and more expensive, but serves as the final quality gate.
- Purpose: Enhances user experience and builds trust.
Part 2: ADK in Action - Evaluating the Bookfinder Agent
This section demonstrates how to apply the three-tier testing pyramid using the Google Agent Development Kit (ADK) with a "Bookfinder" agent as a case study.
-
Bookfinder Agent Example:
- Model: Uses Gemini 2.5 Pro as its "brain."
- Tools:
Search local librarytoolFind local bookstoretoolOrder onlinetool
- User Query Example: "Order this book Harry Potter for me."
- Agent Logic: The agent first checks the local library, then a local bookstore. If the book is not found in either local option, it proceeds to order online.
-
What to Measure:
- Trajectory: The entire journey the agent takes to solve the task. This includes:
- The sequence of tool calls and their arguments.
- Intermediate steps and reasoning.
- The final response.
- Final Answer: The ultimate output provided to the user, such as recommending an online retailer or confirming local options failed.
- Trajectory: The entire journey the agent takes to solve the task. This includes:
-
ADK Built-in Metrics for Evaluation:
- Trajectory Average Score:
- Function: Measures the similarity between the actual sequence of tool calls and the expected sequence.
- Scoring: A score of 1 means 100% match required; 0 means no match required. A threshold between 0 and 1 is set to define passing criteria.
- Response Match Score:
- Function: Assesses how similar the final answer is to the expected answer.
- Method: Uses ROUGE-1 to check word overlap, combining precision and recall into a score from 0 to 1.
- Threshold: A threshold (e.g., 0.5 for "close enough," 0.8 for "very strict") is chosen to accommodate natural language variations.
- Note: Due to the nondeterministic nature of agents, avoiding a score of 1 is recommended unless temperature is set to zero.
- Trajectory Average Score:
-
Designing Tests with ADK:
- Tier 1 Component Tests:
- Goal: Test simple prompts and expected tool calls for correctness.
- Checks: Tool selection accuracy (e.g., did it pick
Search local library?) and parameter correctness (e.g., did it passtitle=Harry Pottercorrectly?). - Action: Fix any failures at this foundational level before proceeding.
- Tier 2 Trajectory Tests:
- Goal: Evaluate multi-step plans and final answer quality.
- Method: Use ADK eval with multi-step tests, defining expected tool sequences.
- Criteria: Set thresholds for
Trajectory Average Score(e.g., 0.8 for a close match to the plan) andResponse Match Score(e.g., 0.5 for natural language variation in the final response).
- Tier 3 Human Review:
- Leverage: ADK provides traces, enabling human reviewers to judge overall quality.
- Tier 1 Component Tests:
Demo: Using ADK for Agent Testing
The video includes a demonstration of ADK in practice.
-
Tier 1 Component Test Example:
- Goal: Verify if a single tool call is selected and filled correctly.
- Checks: Tool selection accuracy and parameter correctness.
- Outcome: Failures indicate issues with tool usage or parameter generation that need immediate fixing.
-
Tier 2 Trajectory Test Example:
- Goal: Ensure the agent follows the expected journey and produces a good final response.
- Setup: Example code is shown for setting passing criteria using
Trajectory Average Score(e.g., 0.8) andResponse Match Score(e.g., 0.5). - Execution: The test is run, and results are displayed.
Conclusion
The video concludes by summarizing the practical application of AI agent evaluation using the three-tier testing pyramid and ADK.
- Process Recap:
- Tier 1: Fast, cheap component checks for correctness.
- Tier 2: ADK eval for full trajectory and final answer quality, assessing capability and reasoning.
- Tier 3: Human review to guarantee a great user experience and trust.
- Key Takeaway: This structured approach, combined with ADK's built-in tools, allows developers to move from an assumption that an agent "works" to a state of knowing it is "very reliable."
- Series Wrap-up: This episode, following a previous one on the theory of agent evaluation, provides practical steps and hands-on demonstrations. Users are now equipped to create their own evaluation sets and test files with ADK.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "How to evaluate agents in practice". What would you like to know?