Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI

Key Concepts

LLM-as-a-Judge: Using a Large Language Model to evaluate the outputs or behaviors of another AI agent.
Calibration: Aligning the LLM-as-a-Judge’s scoring with human expert annotations to ensure reliability.
GAPA (Genetic Algorithm for Prompt Adaptation): An optimization algorithm that uses evolutionary strategies (mutation and merging) to refine prompts.
Prompt Optimization: The iterative process of improving a prompt to achieve better performance on specific evaluation tasks.
Pareto Frontier: A selection strategy in optimization that prioritizes diversity by identifying the best-performing candidates for specific subsets of tasks.
LLMOps: The operational lifecycle of building, monitoring, and evaluating LLM applications (e.g., Agenta).
Data Flywheel: A self-improving loop where observability traces lead to new evaluations, which in turn optimize the system.

1. The Problem: Unreliable Evaluation

The speaker highlights a common failure in production: relying on generic "hallucination checks" that lack context. If an agent is failing, a generic judge often cannot identify why because it lacks the specific business logic or policy constraints required for the task. The bottleneck in AI development is the evaluation loop; human annotation is accurate but slow, while uncalibrated LLM judges provide "useless signals" that lead to fast but incorrect iterations.

2. Methodology: Building a Calibrated Judge

The speaker proposes a four-step framework for building a reliable judge:

Design Metrics: Define specific, business-relevant axes (e.g., policy adherence, tool usage) rather than generic scores.
Data Curation & Annotation: Use subject matter experts to annotate traces. Crucially, include reasoning in the annotations, as this provides the "why" that the LLM needs to learn.
Optimization (GAPA): Use the GAPA algorithm to iteratively improve the judge's prompt.
Validation: Test the optimized judge against a held-out validation set.

3. The GAPA Optimization Process

GAPA functions like a genetic algorithm:

Seed Candidate: Start with a baseline prompt (e.g., "Assume the agent is compliant unless there is a specific reason not to be").
Mutation: Use an LLM to reflect on failed evaluations and rewrite the prompt to address those specific errors.
Merging: Combine successful prompts to synthesize a more robust set of instructions.
Pareto Selection: Instead of averaging scores, select a set of prompts that collectively cover all edge cases in the test set.

4. Key Findings and Real-World Applications

Case Study: An airline customer support agent using the Towbench dataset. The goal was to evaluate policy adherence (e.g., verifying cancellation rules before approving).
Data Quality: The speaker emphasizes that the quality of the annotation is the most critical factor. Without clear reasoning in the training data, the model cannot learn to distinguish between compliant and non-compliant behavior.
Model Selection: Larger models (e.g., GPT-4) are necessary for the refinement (reflection) step, while smaller, cheaper models can be used for the judge role once optimized.
Performance Gains: In the provided example, accuracy improved from 69% to 74%, and the judge’s bias toward "compliant" was significantly reduced, leading to better recall of non-compliant cases.

5. Notable Quotes

"The bottleneck in this loop is actually the evaluation. How fast can you evaluate?"
"It does not make sense to have one LLM as a judge which is 'success'... it makes a lot of sense to make things very specific."
"If you have access to the agent policy from the beginning, then it's very hard to fine-tune it. You're already stuck in a local minima."

6. Actionable Insights

Start with a "Compliant" Bias: When designing a judge, instruct it to assume compliance unless evidence suggests otherwise; this reduces false positives.
Avoid Over-Generalization: Create separate judges for separate error types (e.g., one for policy, one for tool usage) rather than one "master" judge.
Iterate Small: Do not run massive experiments immediately. Perform small iterations, visualize the traces, and manually inspect the reflection prompts to ensure the algorithm is learning the right logic.
Cost Management: Optimization is expensive (token-heavy). Invest in the optimization phase to create a highly efficient, smaller prompt that can be run cheaply in production.

Conclusion

Building a calibrated LLM-as-a-Judge is not a "plug-and-play" task. It requires high-quality, human-reasoned annotations and an iterative optimization process. By using algorithms like GAPA and focusing on specific, granular metrics, developers can create a data flywheel that allows their AI agents to improve automatically over time.