Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic

By AI Engineer

Share:

Key Concepts

  • Agentic Harness: The scaffolding (framework) surrounding an LLM that enables it to perform complex, long-running tasks.
  • Context Rot/Anxiety: The degradation of coherence or the model's "nervousness" as it approaches the limits of its context window.
  • Adversarial Evaluation (GAN-style): A pattern where a "Generator" agent builds code and a separate "Evaluator" agent (using tools like Playwright) critiques it, creating a feedback loop.
  • Compaction: The process of summarizing or condensing session history to maintain coherence over long durations.
  • Agent SDK: A framework providing primitives (tools, sub-agents, permission systems) for building autonomous agents.
  • Contract-based Development: A methodology where agents negotiate specific, testable criteria before beginning a task.

1. Evolution of Long-Running Agents

The speakers, Ash and Andrew from Anthropic, trace the evolution of Claude’s ability to run for extended periods (from 20-minute sessions to 12+ hour autonomous runs).

  • Historical Progression:
    • Early Stage: Models struggled with basic bash commands and string escaping.
    • Mid-Stage: Introduction of "Computer Use" (taking screenshots/clicking) and MCP (Model Context Protocol) for tool usage.
    • Current State: Models like Opus 4.6 and Sonnet 4.6 exhibit high "agentic" intelligence, capable of managing their own context and planning complex workflows.
  • Key Shift: The industry has moved from "one-shotting" tasks to building persistent, multi-hour agentic loops.

2. The "Generator-Evaluator" Framework

The core methodology presented for building robust, long-running agents is an adversarial harness inspired by Generative Adversarial Networks (GANs).

  • The Roles:
    • Planner: Breaks a vague prompt into high-level sprints.
    • Generator: Executes the code for a specific feature.
    • Evaluator: Uses tools (e.g., Playwright) to interact with the live application, verify functionality, and score the output against a rubric.
  • The "Contract" Mechanism: Before building, the Generator and Evaluator negotiate what "done" looks like. This creates a testable contract, preventing the model from "rubber-stamping" its own work.
  • Adversarial Pressure: If the Generator fails to meet the rubric, the harness can discard the work and restart, rather than attempting to patch broken code indefinitely.

3. Methodologies for Success

  • Persistent State: Instead of relying solely on context windows, use the file system (JSON files) to store progress, feature lists, and logs. This allows for "fresh" context windows while maintaining state.
  • Rubric-based Grading: Quality is not subjective if defined clearly. The team uses a four-criteria rubric: Design, Originality, Craft, and Functionality.
  • Debugging via Traces: The primary way to improve agent performance is reading raw execution traces to identify where the model’s judgment diverges from human expectations, then adjusting prompts accordingly.
  • Progressive Disclosure: Loading only the "front matter" of a skill into the context window to save tokens, only pulling in the full body when the tool is instantiated.

4. Key Arguments and Perspectives

  • Harnesses are not dead: Even as models become more intelligent, the harness evolves rather than disappears. It fills the "gaps" in the model's current capabilities.
  • Empathy for the Model: Developers must "empathize" with the model by understanding its limitations (e.g., navigating a browser without seeing the full page).
  • Predictability over Randomness: It is better to fail predictably in a controlled loop than to succeed unpredictably.

5. Notable Quotes

  • "The frontier doesn't really shrink, it just moves." — Ash (referencing the evolving capabilities of models vs. the need for scaffolding).
  • "It's very easy for me to critique a lovely piece of artwork... much harder for me to actually go ahead and paint that." — Ash, explaining why a separate Evaluator is necessary.
  • "Better to fail predictably than it is to succeed unpredictably." — A core philosophy behind the Ralph Wiggum/looping technique.

6. Synthesis and Conclusion

The transition from simple, short-lived agent sessions to complex, multi-hour autonomous systems relies on decoupling roles (Planner, Generator, Evaluator) and enforcing adversarial feedback loops. While models are becoming more capable of managing their own context, the most effective current approach involves using a file-system-based state, clear rubric-based contracts, and rigorous trace analysis. The ultimate goal is to build systems that can operate autonomously, with humans intervening only to refine the harness or the initial prompt, rather than micromanaging the execution.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video