Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar
By Lenny's Podcast
Key Concepts
- Evals: Systematic measurement and improvement of AI application quality.
- Error Analysis: Identifying and categorizing errors in AI application logs.
- Open Coding: Initial, unstructured note-taking on observed errors.
- Axial Coding: Categorizing open codes into actionable failure modes.
- Benevolent Dictator: A single, domain-expert decision-maker for eval processes.
- Theoretical Saturation: Point where further data analysis yields no new insights.
- Code-Based Evals: Automated tests using code to check for specific failure modes.
- LLM as Judge: Using an LLM to evaluate complex failure modes with a binary (pass/fail) output.
- Criteria Drift: Changes in evaluator's opinions of "good" and "bad" as they review more outputs.
What are Evals?
Evals are a systematic way to measure and improve the quality of AI applications. They involve data analytics on the application, creating metrics to measure performance, and iterating to improve the application based on feedback signals.
- Example: A real estate assistant application that isn't writing emails correctly or calling the right tools. Evals help create metrics to measure these issues and improve the application.
- Analogy: While unit tests are a part of evals, evals encompass a broader spectrum of ways to measure application quality, including vague or ambiguous aspects like responding to new user requests.
Error Analysis: The Foundation of Evals
Error analysis is the first step in building effective evals. It involves looking at data from the AI application and identifying what's going wrong.
-
Process:
- Data Collection: Gather logs from the AI application, including system prompts, user inputs, tool calls, and AI responses.
- Open Coding: Manually review the logs (traces) and write notes on any observed errors or undesirable behaviors. Focus on the most upstream error in each trace.
- Axial Coding: Categorize the open codes into actionable failure modes. This can be done with the help of LLMs.
- Quantify: Count the occurrences of each failure mode to prioritize areas for improvement.
-
Example: Nurture Boss (AI Assistant for Property Managers)
- Scenario: A user asks about a one-bedroom apartment with a study, and the AI responds that none are available.
- Error: The AI should have handed off to a human agent instead of ending the conversation.
- Note: "Should have handed off to a human."
- Another Scenario: The AI offers a virtual tour when none exists.
- Error: Hallucination.
- Note: "Offered virtual tour when not available."
-
Tools: Observability tools like Braintrust, Phoenix Arise, and Langsmith can be used to view and annotate traces.
-
Benevolent Dictator: Appoint a domain expert (e.g., product manager) to lead the error analysis process and make decisions about error categorization.
-
Theoretical Saturation: Continue analyzing traces until no new types of errors are being discovered. Aim for at least 100 traces initially.
Using AI for Error Analysis
LLMs can assist in categorizing open codes into axial codes.
-
Process:
- Export open codes into a CSV file.
- Use a tool like Claude, ChatGPT, or Julius AI to analyze the CSV and suggest axial codes (categories).
- Iterate on the suggested axial codes to make them more specific and actionable.
- Use an LLM to categorize each trace into one of the refined axial codes.
- Create a pivot table to count the occurrences of each axial code.
-
Prompt Engineering: Use specific language like "open codes" and "axial codes" in prompts to guide the LLM.
-
Example Prompt: "Please analyze the following CSV file. There is a metadata field that has a note in it. I have different open codes and that's a term of art. That's LMS know what open codes are and they know what axial codes are because it is a it is a concept that's been around for a really long time. So those words help me shortcut like what I'm trying to do."
Code-Based Evals vs. LLM as Judge
After identifying failure modes, determine whether to use code-based evals or LLM as Judge.
- Code-Based Evals: Use code to automatically check for specific, well-defined failure modes (e.g., output format, string length).
- LLM as Judge: Use an LLM to evaluate more complex, subjective failure modes (e.g., whether a human handoff was appropriate).
Building an LLM as Judge
LLM as Judge involves using an LLM to evaluate whether the AI application is behaving as expected in specific scenarios.
-
Process:
- Define the Failure Mode: Clearly define the specific failure mode you want to evaluate (e.g., inappropriate human handoff).
- Create a Binary Judge Prompt: Write a prompt that instructs the LLM to output "true" or "false" based on whether the failure mode occurred.
- Example Prompt: "Output true or false. Is there a handoff error? Explicit human request ignored or looped. Some policy mandated transfer, sensitive resident issues, tool data unavailability, same day walk-in or tour requests, you need to talk to a human for that."
- Align with Human Judgment: Evaluate the LLM's judgments against a set of manually labeled traces (axial codes) to ensure alignment.
- Measure Agreement: Calculate the agreement rate between the LLM and human judgments. However, be cautious of high agreement rates that may be misleading due to infrequent errors.
- Iterate on the Prompt: Refine the prompt based on any misalignments between the LLM and human judgments.
-
Matrix Analysis: Create a matrix comparing human and LLM judgments to identify specific types of errors (e.g., human says "false," LLM says "true").
-
Online Monitoring: Use LLM as Judge to monitor application performance in production by sampling traces and evaluating them in real-time.
The Debate Around Evals
There is some controversy and debate around the importance and value of evals.
- Misconceptions:
- Evals are just unit tests.
- AI can automatically handle evals without human involvement.
- Evals are unnecessary if you have AB tests.
- Counterarguments:
- Evals encompass a broad range of techniques, including data analysis, code-based tests, and LLM as Judge.
- Human judgment is essential for defining failure modes and aligning LLM judges.
- AB tests are a form of eval, but they should be informed by error analysis.
- Key Point: The debate often stems from narrow definitions of evals and negative experiences with poorly implemented eval processes.
Tips and Tricks for Successful Evals
- Look at Your Data: Regularly review application logs and identify patterns of errors.
- Don't Be Afraid to Fix Obvious Errors: If you identify a simple fix, implement it immediately instead of over-analyzing.
- Use LLMs to Assist, Not Replace: Leverage LLMs for tasks like categorizing data and generating prompts, but always validate their outputs with human judgment.
- Start Simple: Begin with basic data analysis and gradually introduce more complex eval techniques.
- Iterate and Refine: Continuously improve your eval processes based on your experiences and the insights you gain.
- Create Custom Tools: Build tools to streamline the process of looking at data and annotating traces.
- Focus on Actionable Improvements: The goal of evals is to improve your product, not to achieve perfect scores.
- Keep Learning: Stay up-to-date on the latest eval techniques and best practices.
Conclusion
Evals are a critical component of building high-quality AI applications. By systematically measuring and improving application performance, developers can create more effective and user-friendly AI products. The key is to start with data-driven error analysis, leverage AI to assist in the process, and continuously iterate and refine your eval techniques.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar". What would you like to know?