Back to all videos

Fighting AI with AI — Lawrence Jones, Incident

By AI Engineer

Input: A summary of video content about AI SRE Evals Coding Agents Backtesting

Share:

Key Concepts

AI SRE (Site Reliability Engineering): Using AI to automate production investigations, log analysis, and incident response.
Evals (Evaluations): Automated "unit tests" for AI prompts, using YAML files to define input, expected output, and grading criteria.
Coding Agents: AI-driven tools (e.g., Claude Code) that interact with codebases and file systems to debug and modify software.
Backtesting: Running a batch of historical investigations against current AI models to measure performance improvements or regressions.
File System Packaging: Converting complex UI-based AI traces and logs into structured, downloadable file systems for agent consumption.
AI Runbooks: Repeatable, agent-driven pipelines for analyzing system performance and identifying root causes.

1. Managing AI Complexity with Internal Tools

Laurence, a founding engineer at incident.io, explains that their platform automates production investigations by cross-referencing logs, metrics, traces, and historical incident data. Because these systems involve hundreds of prompts and tool calls, they have become too complex for manual human debugging. The core strategy is to use AI to manage the AI, ensuring that the internal tooling is as sophisticated as the product itself.

2. The Eval Red-Green Cycle

The team treats evals as AI unit tests.

Methodology: Evals are stored in YAML files alongside Go code. Each eval defines grading criteria (e.g., "meaning preservation" or "style adherence").
The Problem: As incident reports grew in size, YAML files became too large for coding agents to process due to context window limits.
The Solution: They built an eval-tool CLI that allows agents to programmatically read, edit, and add test cases. This enables a "Red-Green" cycle:
1. An agent identifies a failure.
2. The agent creates a new eval case to reproduce the failure.
3. The agent modifies the prompt to pass the new test.
4. The agent runs the full suite to ensure no regressions were introduced.

3. Debugging via File System Packaging

A major breakthrough for the team was moving away from UI-based debugging to file system-based debugging.

Process: Instead of forcing agents to navigate complex UIs, the team exports AI interactions (traces, tool calls, logs) into a structured, downloadable file system.
Application: This data is dropped into a sandbox environment (e.g., Claude Code). Because the data is in a standard file format, the agent can "grep" through it, understand the hierarchy of prompts, and pinpoint exactly which part of the system caused an incorrect RCA (Root Cause Analysis).

4. Scalable Analysis with AI Runbooks

To manage thousands of investigations across customer accounts, the team uses Backtests and AI Runbooks (stored in a repository called scrapbook).

Parallelization: The pipeline triggers ~25 agents in parallel to analyze individual investigations.
Cohort Clustering: Agents group similar failure types to identify systemic issues rather than isolated bugs.
Actionable Output: The pipeline produces a markdown report that links the failure directly to the relevant code, allowing the developer to fix the issue immediately within the same session.

5. Key Arguments and Perspectives

Agents prefer file systems: Laurence argues that file systems are superior to MCP (Model Context Protocol) or human-in-the-loop UI interactions for debugging because they allow agents to perform bulk analysis and pattern matching efficiently.
Self-Documenting Systems: By structuring AI interactions as files, the system becomes self-documenting, making it easier for agents to navigate the "hierarchy of tools and prompts."
Continuous Improvement: The goal is not just to fix bugs but to use the analysis pipeline to simplify prompts, preventing "prompt bloat" over time.

6. Synthesis and Conclusion

The primary takeaway is that AI engineering requires a robust internal infrastructure. By treating AI interactions as data that can be versioned, tested, and analyzed via agents, teams can move from manual, reactive debugging to a scalable, automated workflow.

Actionable Insights:

Prioritize Debugging Tools: Build tools that allow agents to interact with your system's internal state as easily as they interact with code.
Standardize Data: Convert complex AI traces into text-based file formats to maximize agent context and utility.
Automate Analysis: Use parallelized agent pipelines to perform cohort analysis on production failures to identify systemic trends.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video