Shipping complex AI applications — Braintrust & Trainline
By AI Engineer
Key Concepts
- AI Observability: The practice of monitoring and tracing AI systems to understand their behavior, performance, and failure modes in production.
- Agentic Workflows: Multi-stage AI systems that use tool calling and sequential logic to perform complex tasks (e.g., triage, policy review, and response generation).
- Golden Set: A curated dataset of test cases used to evaluate AI performance and identify regressions.
- LLM-as-a-Judge: Using an AI model to evaluate the output of another AI system based on specific rubrics (e.g., tone, helpfulness, policy compliance).
- Managed Mode: A deployment pattern where prompts, tools, and parameters are stored and versioned in a centralized platform (Brain Trust) rather than hardcoded locally.
- Flywheel Effect: The continuous loop of evaluating, identifying failure modes, remediating, and monitoring to improve system quality over time.
1. Main Topics and Key Points
The workshop focused on transitioning generative AI applications from local prototypes to production-ready, mission-critical systems. The speakers emphasized that the primary hurdle in AI adoption is not the intelligence of the models, but the lack of operational rigor.
- Deterministic vs. Non-deterministic: Traditional software is deterministic (1+1=2), whereas LLM systems are non-deterministic. Successful AI engineering requires a hybrid approach that combines traditional software quality checks with AI-specific evaluation metrics.
- Scaling Challenges: As systems grow, developers face issues with latency, token costs, and model switching. The speakers highlighted the need for structured observability to track "time to first token" and execution paths.
2. Real-World Application: Trainline
The team from Trainline shared their experience managing a complex travel assistant.
- Scale: They handle 6.3 billion ticket searches and serve 27 million active users.
- Use Case: Their travel assistant is a multi-agent system capable of handling refunds, changing tickets, and providing disruption alerts.
- Problem: They struggled with high costs from API providers (OpenAI/Anthropic) and the difficulty of ensuring that switching to cheaper models did not degrade performance.
- Solution: Using Brain Trust, they performed offline evaluations to simulate performance before deployment and used online observability to monitor production behavior.
3. Step-by-Step Methodology
The workshop outlined a framework for building and operationalizing AI agents:
- Scaffolding: Building a basic agent with a single-shot prompt.
- Decomposition: Breaking monolithic LLM calls into multi-stage agentic workflows (e.g., Triage → Policy Review → Reply Writer → Escalation).
- Tracing: Instrumenting the application to track nested tool calls and metadata.
- Evaluation: Creating a "Golden Set" of test cases and using both deterministic and LLM-as-a-judge scoring functions.
- Managed Deployment: Offloading prompts and parameters to a secure, versioned environment to allow cross-functional collaboration (e.g., allowing non-technical product managers to update prompts).
- Remediation: Identifying production failures, modifying prompts, and running evaluations to ensure the fix improves performance without regressions.
4. Key Arguments
- Observability is Table Stakes: The speakers argued that if you are running an AI application in production without tracing, you lack the visibility required to debug failures.
- Collaboration: AI development should be cross-functional. By moving prompts into a managed platform, teams can avoid the "tap on the shoulder" bottleneck where product managers must rely on engineers to change simple text prompts.
- Continuous Improvement: Perfection is the enemy of good. Start with an evaluation set, even if it is small, and iterate continuously.
5. Notable Quotes
- "Traditional software engineering is very deterministic... LLM systems are having to adjust and make sure that we're delivering that to scale." — Jirean
- "Brain trust enables you to look inside complex agentic workflows up to like tool call level, token level, which is very insightful and helps you debug a lot of things in production." — May I
- "There is no substitute for real-world data." — Jirean
6. Synthesis and Conclusion
The workshop provided a comprehensive roadmap for moving AI from "demo state" to "production grade." By treating AI systems with the same operational rigor as traditional software—using tracing, versioned prompts, and automated evaluation—organizations can ship faster and with higher confidence. The core takeaway is that observability and evaluation are not optional; they are the essential infrastructure required to manage the non-deterministic nature of modern AI agents.
Chat with this Video
AI-PoweredLoad the transcript when you're ready to chat so the initial page stays lighter.