AI Engineer World's Fair 2025 - Evals

Key Concepts

Evals: Structured tests to measure the performance of AI systems.
Bitter Lesson: The idea that general methods that scale well outperform complex, domain-specific methods in AI.
Premature Optimization: Hard-coding solutions at a lower level of abstraction than necessary, hindering scalability and adaptability.
Separation of Concerns: Decoupling task definition, inference time strategy, and formatting/parsing in AI systems for better engineering.
Data Flywheel: A continuous cycle of collecting user feedback, building evals, improving the product, and attracting more users.
Trajectory Evals: Evaluating an agent's entire run, including all tool calls and artifacts, rather than just the end state.
Rubrics-Based Scoring: Using an LLM to judge runs based on handcrafted rubrics that describe specific aspects to pay attention to.
Fast Evals: Using a set of query and document pairs to quickly and inexpensively measure the performance of a retrieval system.
Golden Data Set: A set of query and document pairs where the document should come out if the query is put in.
Kura: A library that allows users to summarize conversations, cluster them, build hierarchies of these clusters, and ultimately allow users to compare their evals across different KPIs.

Main Topics and Key Points

1. The Bitter Lesson and AI Engineering

The Bitter Lesson: Rich Sutton argues that AI research should focus on general methods that scale, like search and learning, rather than domain-specific knowledge.
Relevance to AI Engineering: Questions the role of AI engineering if leveraging domain knowledge is considered detrimental.
Intelligence vs. Reliable Systems: Software is built for reliability, robustness, and controllability, not just AGI.
Engineering as Subtraction of Agency: Reliable systems involve carefully subtracting agency and intelligence in specific areas.
Takeaway: Focus on engineering what the system is learning for, not the specifics of search and learning.

2. Premature Optimization and Abstraction

Premature Optimization: Analogous to structured programming's warning against premature optimization.
Definition: Hard-coding solutions at a lower level of abstraction than justifiable.
Example: Computing a square root using bit manipulation instead of using a built-in function.
Hypothesis: Premature optimization occurs when hard-coding at a lower level of abstraction than necessary.
Tight Coupling: A significant issue in applied machine learning and prompt engineering.
Takeaway: Invest in defining things specific to your AI system and decouple from lower-level, swappable pieces.

3. Prompts as a Horrible Abstraction for Programming

Prompt's Shortcomings: Described as a "stringly typed canvas" that couples task definition with random, overfitted decisions.
Entanglement: Prompts entangle fundamental task definitions with inference time strategies and formatting/parsing details.
Separation of Concerns: Advocates for separating the spec, code, and natural language descriptions.
Evals: Emphasizes the importance of evals for iterating and appeasing specific models while maintaining core behavior.
Code's Role: Highlights the need for code to define tools, structure, and control information flow.
Good Canvas: A canvas should allow expressing specs, code, and natural language in a streamlined and decoupled manner.

4. Building Good AI Agents: The Data Flywheel

Challenge: Building good AI agents is hard, especially for non-technical users.
Probabilistic Software: Building probabilistic software is different from building traditional software.
Data Flywheel: A continuous cycle of collecting feedback, building evals, improving the product, and attracting more users.
Actionable Feedback: Collecting actionable feedback is the first step in the data flywheel.
Instrumentation: Instrumenting code to record tool calls, errors, and pre/post-processing steps.
Repeatable Runs: Striving to make runs repeatable for eval purposes.

5. Collecting Actionable Feedback

Explicit User Feedback: High-signal feedback, but often rare.
Contextual Feedback: Asking for feedback in the right context to increase submissions.
Implicit Feedback: Mining user interaction for implicit signals, such as turning on an agent after testing or copying a model's response.
LLM for Frustration Detection: Using an LLM to detect and group frustrations from user interactions.
Traditional User Metrics: Mining traditional user metrics for implicit signals.

6. Analyzing Data and Building Evals

LM Ops Software: Using LM Ops software to understand agent runs and identify failure sources.
Internal Tooling: Building internal tooling to understand data in a specific domain context and turn failures into evals.
Feedback Aggregation: Aggregating feedback, clustering, and bucketing failure modes to identify areas for improvement.
Reasoning Models: Using reasoning models to explain failures and direct attention to interesting aspects of the run.
Eval Hierarchy: Building different types of evals, including unit test evals, trajectory evals, and AB testing.

7. Types of Evals

Unit Test Evals: Predicting the n+1 state from the current state, useful for simple assertions.
Trajectory Evals: Evaluating the entire agent run, including all tool calls and artifacts.
LM as a Judge: Using an LLM to grade or compare results from evals.
Rubrics-Based Scoring: Using an LLM to judge runs based on handcrafted rubrics that describe specific aspects to pay attention to.

8. Closing Thoughts on Evals

Don't Obsess Over Metrics: A good metric becomes a bad target when it's achieved 100%.
Data Set Division: Dividing the data set into regression and aspirational pools.
User Satisfaction: The ultimate goal is user satisfaction, not maximizing scores in a lab-like setting.
AB Testing: Using AB testing to verify improvements with real users.

9. Evals at the Application Layer

Focus: Evals for users, apps, and data, not just model releases.
Unreliability of LLMs: LLMs can be unreliable, which is a significant challenge for AI apps.
Demo Savvy: AI apps can be demo savvy but fail in production.
Data Collection: Collecting thumbs up/down data, reading through logs, and using community forums to understand user queries.
Understanding the Court: Understanding the boundaries of the data and testing across the entire court.

10. Building Evals: Data, Tasks, and Scores

Data: Points on the court representing user queries.
Task: The way to shoot the ball towards the basket.
Score: Checking if the shot went in the basket.
Constants and Variables: Putting constants in data and variables in the task.
Deterministic Scoring: Leaning towards deterministic scoring and pass/fail for easier debugging.
Adding Prompts for Scoring: Adding extra prompts to the original prompt to make scoring easier.
Evals in CI: Adding evals to CI to get eval reports for PRs.

11. Brain Trust: End-to-End Developer Platform for AI Products

Core Concepts: Evals, prompt engineering, evaluability.
Task: The code or prompt to evaluate.
Data Set: Real-world examples to run the task against.
Scores: Logic behind the evals, including LLM as a judge and code-based scores.
Offline vs. Online Evals: Pre-production iteration vs. real-time tracing in production.
Improving Evals: Matrix for improving evals based on output and score quality.

12. Brain Trust Components: Task, Data Sets, and Scores

Task: Prompts, agentic workflows, access to tools, mustache templating.
Data Sets: Real-world examples with inputs, expected outputs, and metadata.
Scores: Code-based scores (TypeScript or Python) and LLM as a judge.
Auto Evals: Out-of-the-box scores for quick starts.

13. Moving to Production with Brain Trust

Logging: Instrumenting the application with Brain Trust code to measure quality on live traffic.
Flywheel Effect: Taking logs from production and adding them back to data sets.
Online Scoring: Configuring scores to run on logs in production with a sampling rate.
Early Regression Alerts: Creating automations to alert when scores drop below a threshold.
Custom Views: Creating custom views to filter logs for human review.
Human Review: Using a dedicated interface for humans to parse through logs and add scores.
User Feedback: Logging user feedback from the application.

14. Bolt Foundry: A Framework for Evals in JavaScript

Focus: Helping JavaScript developers run evals and create synthetic data.
Approach: Building prompts through a structured way similar to newspaper writing.
JSON Validator: Example of a simple JSON validator using an LLM as a grader.
Samples: Messages come in as user message, assistant response, and optional score.
Description: Providing descriptions for team understanding.
Divergence: Showing where the grader diverged from the ground truth.

15. Korea: Evaluations Taking into Account Human Perception

Challenge: AI models are unable to answer simple questions that humans can easily react to.
Compression and Evaluation: Relating compression techniques like JPEG to how we think about evaluation.
Perceptual Awareness: Using perceptually aware metrics that take into account how humans perceive the world.
Relativity of Metrics: Recognizing that metrics are relative and that art can have meaning beyond what is conveyed in the image.
Evolving Evals: Evolving evals to take into account individual opinions and visual learning styles.

16. Chroma and Kura: Looking at Inputs and Outputs

Fast Evals: Using a set of query and document pairs to quickly and inexpensively measure the performance of a retrieval system.
Golden Data Set: A set of query and document pairs where the document should come out if the query is put in.
LLM for Query Generation: Using an LLM to write questions and align them with real-world queries.
Kura: A library that allows users to summarize conversations, cluster them, build hierarchies of these clusters, and ultimately allow users to compare their evals across different KPIs.
Data-Driven Product Roadmap: Using data analysis to define the product roadmap.

17. 2025: The Year of the Evals

Thesis: IML monitoring and evaluation are two sides of the same sword.
Three Concurrent Events: AI became understandable to non-technical executives, budget freezes forced focus on specific projects, and systems are now acting for humans.
Agents: Agents are starting to make decisions and take actions, increasing complexity and risk.
Connecting to Business KPIs: The key is to connect evaluations to downstream business KPIs.
Seuite Alignment: The CEO, CFO, CTO, CIO, and CISO are all aligned about the need to understand evaluation from AI.

Synthesis/Conclusion of the Main Takeaways

The YouTube video transcript emphasizes the critical role of evaluations (evals) in building reliable and effective AI systems. It highlights the importance of moving beyond generic, off-the-shelf solutions and investing in custom evals that are tailored to specific use cases and data. The speakers stress the need for a data-driven approach, where user feedback and production logs are continuously incorporated into the eval process. They also advocate for a holistic view of AI systems, considering not just the prompt but also the tools, data, and scoring functions. The transcript underscores that 2025 is poised to be the year when evals become a top priority for enterprises, driven by the increasing adoption of agentic systems and the need to quantify risk and ROI.