From Stateless Nightmares to Durable Agents — Samuel Colvin, Pydantic

Key Concepts

Durable Execution Frameworks: Systems designed to manage long-running, potentially interrupted, computational tasks, ensuring that progress is not lost and can be resumed.
Temporal: An open-source, durable execution framework that provides reliable workflow orchestration.
Pyantic AI: A framework for building AI agents and applications, integrating with various LLMs and tools.
Pyantic Logfire: A logging and observability tool for Pyantic AI applications.
Pyantic Evals: A tool for evaluating and comparing the performance of different AI models.
Workflows and Activities (Temporal): Temporal's core concepts where workflows are deterministic sequences of operations, and activities are non-deterministic operations (like I/O) that are executed by workers.
Agent: In the context of AI, an entity that can perceive its environment, make decisions, and take actions, often by calling tools or LLMs.
Tool Calling: The ability of an LLM to invoke external functions or services to perform specific tasks.
Deep Research: A complex AI task involving multiple steps like web searching, data analysis, and summarization to answer intricate questions.
Task Group (Python): A mechanism for running multiple asynchronous tasks concurrently.
Caching (Temporal): Temporal's ability to replay past deterministic operations by returning cached results, speeding up resumption.

Pyantic AI Temporal and Logfire Demo

This presentation demonstrates the integration of Pyantic AI with durable execution frameworks, specifically Temporal, and showcases the observability capabilities of Pyantic Logfire. The core problem addressed is the unreliability of long-running AI tasks, where losing progress due to system failures or scaling events is costly and frustrating for users.

1. The Problem: Unreliable Long-Running AI Tasks

LLM Applications: While simple LLM queries often work without issue, longer-running workflows, especially those involving significant computation or user interaction, are prone to failure.
Cost of Lost Compute: When a long-running task fails, the invested compute time and resources are lost, necessitating a restart from scratch.
Real-World Examples: Companies like OpenAI reportedly use Temporal for their deep research in LLMs, highlighting the need for robust execution in demanding AI applications.

2. Toy Example: 20 Questions Game

Scenario: Two AI agents play a game of 20 questions. One agent (Answer Agent) has a secret object (e.g., a potato) and answers questions with nuanced responses like "yes," "kind of," "not really," or "no." The other agent (Questioner Agent) asks questions to guess the object.
Agents:
- Answer Agent: Uses a smaller model (e.g., Hiku 3.5) to provide answers based on the secret object and the question.
- Questioner Agent: Uses its context and calls an ask_question tool to interact with the Answer Agent.
Initial Implementation (Without Durable Execution):
- The game runs, and the Questioner Agent eventually guesses "potato."
- Problem: If the process dies (due to unreliable endpoints, Kubernetes scaling, etc.), the entire game must restart from the beginning, which is problematic for longer games.
Equivalence to Deep Research: The speaker argues that this simple game is conceptually similar to deep research, where an agent "quests" for an answer by interacting with intermediate steps (like web searches or RAG) analogous to asking riddles.

3. Introducing Temporal for Durable Execution

Concept: Temporal provides durable execution by distinguishing between deterministic workflows and non-deterministic activities.
- Workflows: Must be entirely deterministic (no I/O, no random.choice). Temporal records every activity call (inputs and outputs) within a workflow.
- Activities: Handle non-deterministic operations like I/O (e.g., calling an LLM, making API requests).
Temporal's Mechanism: Temporal runs workflows, records their execution history, and can replay past activities with cached results if a workflow is resumed. This effectively provides built-in caching for I/O operations.
Pyantic AI Integration:
- Agents are wrapped with temporal_agent to make them compatible with Temporal.
- The temporal_agent handles the conversion of LLM calls and tool usage into Temporal activities.
- Challenge: OpenAI's Temporal support reportedly doesn't handle tool calls as activities, which the speaker considers a significant limitation.
Temporal Setup:
- Temporal Server: Runs locally (open-source version) or in the cloud (for production).
- Worker: Executes the Temporal activities.
- Execution: Workflows are initiated using execute_workflow.

4. Demo: 20 Questions with Temporal

Simulating Unreliability: A 20% chance of failure was introduced within the ask_question tool to simulate an unreliable system.
Temporal's Resilience:
- When the simulated failure occurred, Temporal automatically handled retries and continued execution.
- Even if the entire process was killed (e.g., by Kubernetes), Temporal could resume the workflow from where it left off.
Resumption Mechanism:
- Temporal stores the workflow state.
- By providing a resume_id (or simply rerunning the script with the same workflow ID), the workflow can be restarted and will automatically pick up from the last completed activity.
- Caching in Action: When resuming, Temporal replays past activities. If the results are cached, they are returned instantly (e.g., in 5 milliseconds), bypassing the actual LLM calls and saving time. This is akin to having granular caching on every I/O operation.
Logfire Integration: Pyantic Logfire provides visibility into the workflow execution, showing calls to LLMs, activities, and the cached replay of past operations.

5. Pyantic Evals: Model Performance Comparison

Purpose: Pyantic Evals was used to compare the performance of different LLMs (GPT 4.1, Gemini, Claude Sonnet 4.5) on the 20 Questions task.
Metrics:
- Cost: Average cost per run.
- Speed: Number of steps (questions) taken to succeed.
Findings (Initial Naive Analysis):
- Gemini appeared significantly faster and cheaper.
- Caveat: It was later discovered that Gemini was "inventing" answers, leading to a false impression of better performance. This highlights the importance of robust evaluation metrics and checks.
Conclusion: Evals are crucial for understanding true model performance and identifying potential biases or shortcuts.

6. Deep Research Case Study

Scenario: A more complex, real-world application simulating "deep research" to answer a query like "Find me a list of hedge funds that write Python in London."
Agent Architecture:
- Plan Agent: Takes the initial query and generates a structured plan (a Pyantic model) for the research.
- Search Agent(s): Execute web searches based on the plan. Multiple search agents can run in parallel.
- Analysis Agent: Consolidates the search results and performs a final analysis to answer the query.
Implementation Details:
- Uses Pyantic models for structured data.
- Leverages Python's task_group for parallel execution of search agents.
- Uses format_as_xml to structure data for the analysis agent.
- Cost Tracking: Logfire shows the cost incurred during the run (e.g., 8 cents).
Problem without Durable Execution: If the deep research process is interrupted, all progress is lost, and the entire multi-step process must be restarted.

7. Deep Research with Temporal

Integration: The same deep research logic is adapted for Temporal.
- Agents are wrapped with temporal_agent.
- The analysis agent's activity duration is increased (e.g., to an hour) to prevent timeouts on long tasks.
- Key Benefit: The core workflow logic (parallelism using task_group, sequential steps) remains identical to the non-durable version. Temporal handles the underlying durable execution.
Execution and Resumption:
- When the workflow is executed, Temporal manages the execution.
- If the process is killed, Temporal automatically resumes the workflow.
- Instant Replay: Past completed activities (like the initial plan generation and parallel searches) are replayed instantly from Temporal's cache. Only the unfinished activities (like the final analysis) need to be re-executed.
Outcome: The deep research query is answered, with the system recommending "Pyantic AI with Temporal" as the best Python agent framework for durable execution and type safety. The output includes an executive summary and a breakdown of trade-offs.

8. Future Announcements

Pyantic AI Gateway: A new platform is coming soon that will allow users to:
- Purchase inference from various LLMs (big and open-source).
- Self-host for enterprise needs.
- Utilize observability features.
Early Access: Interested users are encouraged to reach out for early access.

9. Conclusion and Call to Action

Main Takeaway: Temporal, integrated with Pyantic AI, provides a robust solution for building and running reliable, long-running AI applications by ensuring durable execution and seamless resumption.
Observability: Pyantic Logfire is essential for monitoring and understanding the execution of these complex workflows.
Further Information: Users are encouraged to scan QR codes for more information on Pyantic AI, Pyantic AI Gateway, and Pyantic Logfire, and to provide feedback.