Playground in Prod: Optimising Agents in Production Environments — Samuel Colvin, Pydantic

By AI Engineer

Share:

Key Concepts

  • Pydantic AI: An agent framework built on the Pydantic library for structured data validation and LLM interaction.
  • Logfire: An observability platform for AI applications, supporting OpenTelemetry, metrics, traces, and prompt management.
  • Jepper: A genetic algorithm-based optimization library used to evolve and improve prompts or agent configurations.
  • Managed Variables: A feature in Logfire allowing dynamic updates to prompts, models, and configurations in production without redeploying code.
  • Evals (Evaluations): The process of testing agent performance against a "golden dataset" to measure accuracy and reliability.
  • Pareto Frontier: An optimization concept used by Jepper to select and breed the best-performing candidates (prompts) to reach an optimal solution.
  • AI Gateway: A service providing a unified API key to access multiple LLM providers (OpenAI, Anthropic, etc.) with built-in observability and fallback mechanisms.

1. Main Topics and Key Points

Samuel, creator of Pydantic, presented a workflow for optimizing AI agents using Jepper and Logfire. The core argument is that while LLMs are powerful, they require rigorous evaluation and optimization—especially when dealing with private, domain-specific data where general-purpose models may struggle.

  • Observability: Samuel argues that "AI observability" is a feature, not a standalone category, and will eventually be absorbed into general observability.
  • Optimization Strategy: Instead of manual prompt engineering, the talk demonstrates using an agent to optimize another agent. This involves running evaluations, comparing results, and using genetic algorithms to iterate toward a "Pareto frontier" of high-performing prompts.
  • Performance Metrics: The demo showed an improvement in accuracy from 87% to 96.7% by using Jepper to evolve the system prompt for a political relation extraction task.

2. Real-World Applications

  • Political Data Analysis: The speaker used an agent to scrape Wikipedia and identify political ancestors of UK MPs, demonstrating how to handle structured output and filter out irrelevant relations (e.g., spouses).
  • Finance/Legal: Mentioned the use case of analyzing millions of invoices or legal documents. In these high-volume, high-stakes environments, optimizing prompts for smaller, cheaper models (like Qwen 3.5 Mini) can save millions of dollars compared to using top-tier models like GPT-4.
  • Shopify Case Study: Shopify transitioned from feeding entire websites to GPT-4 to using an agentic approach with Qwen and Jepper, reducing costs from $5M/year to ~$73k/year while maintaining performance.

3. Methodologies and Frameworks

  • The Evaluation Loop:
    1. Define Schema: Use Pydantic models to define the expected output structure.
    2. Golden Dataset: Create a reference dataset of "correct" answers.
    3. Run Evals: Execute the agent against the dataset and compare outputs to the golden set.
    4. Optimize: Use Jepper to propose new prompts based on the evaluation results.
    5. Deploy: Use Logfire Managed Variables to push the optimized prompt to production without redeploying.
  • Genetic Optimization (Jepper): Jepper treats the prompt as a string (or JSON) and "breeds" successful candidates by combining their best components, effectively hill-climbing toward a higher accuracy score.

4. Key Arguments

  • Deterministic vs. LLM-as-a-Judge: Samuel advocates for deterministic evals (comparing against a golden set) over "LLM-as-a-judge," which he describes as "the lunatics running the asylum."
  • The Value of Private Data: Optimization is most valuable when models have not been trained on the specific domain data. In these cases, context engineering and prompt optimization are critical.
  • Avoid Over-Engineering: He notes that for many, "vibes-based" development is sufficient, but for high-scale enterprise tasks, rigorous evals and optimization are mandatory.

5. Notable Quotes

  • "I don't really believe in AI observability; I think it's a feature, not a category."
  • "The ultimate eval is wait 40 years and see when they died... but obviously you can't have an eval where you wait 40 years." (On the difficulty of defining "correct" in open-ended tasks).
  • "The big model labs say don't bother fine-tuning... but that misses the cases in finance where they have enormous numbers of runs where they really do care about that optimization."

6. Synthesis and Conclusion

The presentation highlights a shift from manual prompt crafting to autonomous agent optimization. By combining Pydantic AI for structure, Logfire for observability and dynamic configuration, and Jepper for genetic optimization, developers can systematically improve agent performance. The primary takeaway is that while "vibes" work for simple tasks, enterprise-grade AI requires a robust pipeline of golden datasets, automated evals, and the ability to update logic in production without redeployment.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video