Two Roads to Durable Agents: Replay vs. Snapshot — Eric Allam, Trigger.dev

By AI Engineer

Share:

Key Concepts

  • Durable Agents: AI agents capable of maintaining state across long-running tasks, system restarts, and code updates.
  • Shared Nothing Architecture: A traditional backend paradigm where compute is stateless, and all state is stored in a database.
  • Replay Model: A workflow execution pattern where side effects are cached; if a process fails, it resumes by replaying previous steps.
  • Context Durability: Storing the append-only log of LLM interactions (messages, tool calls, responses).
  • Execution Durability: Preserving the state of the compute environment (memory, file system, processes) via snapshotting.
  • Snapshot and Restore: A technique to save the entire state of a virtual machine to disk and resume it later, avoiding the need for continuous execution.
  • Firecracker MicroVMs: Lightweight virtualization technology used to isolate and snapshot entire machine states efficiently.
  • CRIU (Checkpoint/Restore in Userspace): A Linux tool for checkpointing processes, which the speaker notes has limitations compared to VM-level snapshots.

1. The Evolution of Backend Infrastructure

The speaker, Eric from Trigger.dev, outlines the history of backend architecture to explain why AI agents require a paradigm shift:

  • CGI (1993): Stateless; a new process is forked for every HTTP request.
  • LAMP Stack/Shared Nothing: The dominant model for 30 years. Compute is stateless, and state is offloaded to a database.
  • Workflow Engines (10–15 years ago): Introduced the "replay model" to handle multi-step side effects (e.g., charging a credit card). While effective for transactions, it creates rigid code structures and struggles with the long-running, session-based nature of AI agents.

2. The Agent Loop Challenge

Modern AI agents differ from traditional workflows because the LLM orchestrates the code, rather than the code orchestrating the LLM.

  • The Problem with Replay: If you apply the replay model to an agent loop, the execution log grows indefinitely as the agent interacts, eventually hitting performance limits.
  • The Shift: Agents are not "transactions"; they are "sessions" that need to persist as long as the user requires.

3. Framework for Durable Agents

To achieve true durability, the speaker proposes splitting the agent into two distinct layers:

  1. Context Layer (Append-only log): Stores all LLM inputs/outputs. This is durable across code versions and system crashes.
  2. Execution Layer (Snapshot/Restore): Stores the "machine" state (memory, open files, subprocesses). This allows the agent to "pause" during idle time (e.g., waiting for a user) and resume instantly without keeping the machine running, which is cost-prohibitive.

4. Technical Implementation: From CRIU to Firecracker

The speaker details the transition from process-level checkpointing to VM-level snapshots:

  • CRIU Limitations: It only captures processes, struggles with complex dependencies (like Chrome or FFmpeg), and requires the process to be "aware" of the checkpointing.
  • Firecracker MicroVMs: By snapshotting the entire VM, the system becomes agnostic to what is running inside.
  • Optimization: To solve the issue of large snapshot sizes (e.g., 512MB), the team implemented seekable compression. This allows the system to restore only the necessary memory pages on demand, reducing snapshot sizes to as low as 14MB.

5. Notable Statements

  • "For 30 years we sort of had this stateless compute as the core of back-end infrastructure. And I think agents are sort of forcing this move to become stateful compute."
  • "An agent isn't like a transaction, it's like a session."

6. Tools and Performance

  • FC Run (F Crun): An upcoming open-source, Docker-like CLI tool that enables snapshotting and restoring Firecracker VMs.
  • Performance Metrics: The speaker reports snapshot times under one second and restore times in the hundreds of milliseconds, with the ability to handle 15,000 VM starts per minute.

Synthesis/Conclusion

The transition from stateless to stateful compute is inevitable for the next generation of AI agents. By combining context durability (via append-only logs) and execution durability (via optimized Firecracker VM snapshots), developers can create agents that are resilient to failures, version updates, and long periods of inactivity. This approach moves beyond the limitations of traditional workflow replay models, enabling agents to perform complex, long-running tasks with the efficiency of a persistent session.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video