[Evals Workshop] Mastering AI Evaluation: From Playground to Production

Key Concepts

Evals: Structured tests to measure the quality, reliability, and correctness of AI systems.
Tasks: The code or prompt being evaluated, requiring an input and an output.
Data Sets: Real-world examples or test cases used to evaluate the task.
Scores: Logic for grading the task's output, ranging from 0 to 1 (percentage). Can be LLM-as-a-judge or code-based.
Offline Evals: Structured testing in development for proactive issue identification.
Online Evals: Measuring the quality of live traffic in production, enabling real-time diagnostics and user feedback.
LLM as a Judge: Using a language model to score outputs based on subjective or contextual criteria.
Code-Based Score: Deterministic scoring using code to evaluate outputs based on objective criteria.
Playgrounds: UI for quick iteration of prompts, scores, and data sets.
Experiments: UI for comparing eval results over time and tracking changes.
SDK (Software Development Kit): A set of tools and libraries that allows developers to interact with the Brain Trust platform programmatically.
Logging: Capturing production data to observe user interactions and identify gaps.
Online Scoring Rules: Defining which scores to use on live traffic and setting a sampling rate.
Custom Views: Customized lenses on logs with filters, sorts, and custom columns.
Human-in-the-Loop: Incorporating human review and user feedback to improve AI systems.

1. Introduction to Evals

Evals are crucial for answering key questions about AI systems, such as model selection, cost optimization, performance in edge cases, brand consistency, and bug detection.
The best LLMs don't guarantee consistent performance, necessitating a testing framework. Hallucinations and performance degradation are common issues.
Evals cut development time, reduce costs by automating manual review, enable faster iteration and releases, optimize model selection, improve quality, and scale teams.
Brain Trust focuses on prompt engineering, measuring improvements/regressions with evals, and AI observability.

2. Core Components of Brain Trust Evals

Task: The code or prompt to evaluate. Can be a single LLM call or a complex agentic workflow. Requires an input and an output. Dynamic templating with mustache is supported. Multi-turn chats and tool calls can be evaluated.
Data Set: The set of real-world examples or test cases. Requires an input column. Optional columns include expected output and metadata. Start small, iterate, use synthetic data initially, and implement human review.
Score: The logic for grading the output. Can be LLM-as-a-judge (subjective, contextual) or code-based (deterministic, binary). Outputs a score from 0 to 1. Use a higher-quality model for LLM-as-a-judge. Focus the LLM-as-a-judge on specific criteria.

3. Offline vs. Online Evals

Offline Evals: Structured testing in development for proactive issue identification. Used in the Brain Trust playground and via the SDK.
Online Evals: Measuring the quality of live traffic in production. Allows diagnosing problems, monitoring performance, and capturing user feedback in real-time.
A matrix helps determine whether to improve evals or the AI app based on the quality of the output and the score.

4. Brain Trust UI: Playgrounds and Experiments

Playgrounds: For quick iteration of prompts, scores, and data sets. Effective for AB testing. Snapshots can be saved to experiments.
Experiments: For comparison over time. Tracks how scores change over weeks/months. Aggregates data from the UI and SDK.
Playgrounds are ephemeral, while experiments are long-lived and used for historical analysis.

5. Brain Trust SDK

The SDK allows defining assets (prompts, scores, data sets) in code and pushing them to Brain Trust. Enables version control and consistent usage across environments.
Two modes: UI and SDK. No limits are placed on the user of the platform.
Evals can be defined in code and run using the brain trust eval command.
Use the SDK for source-controlled prompt versioning, consistent usage across environments, and online scoring.
The SDK looks for files named eval.ts when running evals.

6. Logging and Observability

Logging is crucial for observability in production. Helps debug, measure quality, and understand user behavior.
Log into Brain Trust via the SDK. Initialize a logger and connect it to a project.
Use wrap open AAI to log all communication with OpenAI. Integrate with Versell AI SDK or OTEL.
Use the @trace decorator to log arbitrary functions. Use span.log to add metadata.

7. Online Scoring

Online scoring measures the quality of live traffic.
Configure online scoring rules in the UI. Define scores and a sampling rate.
Set early regression alerts. AB test different prompts.
Create custom views to filter and sort logs based on specific criteria.

8. Human-in-the-Loop

Human review and user feedback are critical for quality and reliability.
Humans can catch hallucinations and ensure the product meets user expectations.
Two types: human review (manual labeling and scoring) and real-time user feedback (thumbs up/down).
Use the log feedback function to capture user feedback.
Enter human review mode in the platform to focus on relevant fields.
Human review is helpful for evaling the LLM as a judge.

9. Activity Walkthrough and Examples

The presenter walks through the process of setting up a Brain Trust project, cloning a repo, configuring API keys, and pushing resources to Brain Trust.
Demonstrates how to create prompts, data sets, and scores in the UI.
Shows how to run evals in the playground and create experiments.
Explains how to log data from a running application and configure online scoring rules.
Demonstrates how to add spans to a data set and create custom views.

10. Questions and Answers

The presenter answers questions about various topics, including:
- Running Brain Trust locally with local models.
- Bootstrapping a data set.
- The subjectivity of LLM-as-a-judge scores.
- The role of traditional machine learning models in evals.
- Implementing Brain Trust on existing projects.
- Automating changes to the function being tested.
- Managing human evals.
- Using data sets and eval scores for few-shot prompting.

11. Synthesis/Conclusion

The workshop provides a comprehensive overview of Brain Trust, a platform for evaluating and improving AI systems. It covers the core concepts of evals, tasks, data sets, and scores, as well as the different modes of operation (offline vs. online, UI vs. SDK). The workshop also emphasizes the importance of logging, online scoring, and human-in-the-loop processes. By following the steps outlined in the workshop, developers can build more reliable, high-quality AI applications. The key takeaway is to start small, iterate frequently, and leverage both automated evals and human feedback to continuously improve the performance of AI systems.