Build Hour: Agent Memory Patterns

Key Concepts

Context Engineering: The discipline of systematically managing and optimizing the information provided to Large Language Models (LLMs) to improve their performance. It's described as both an art (requiring judgment) and a science (with repeatable methods).
Agent Memory Patterns: Techniques for enabling AI agents to retain and utilize information across multiple interactions or sessions, crucial for long-running and complex tasks.
Context Window: The finite limit of tokens an LLM can process at any given time.
Short-Term Memory (In-session Techniques): Strategies focused on maximizing the use of the context window during an active interaction.
Long-Term Memory (Cross-session Techniques): Strategies for building continuity and retaining information across multiple, distinct sessions.
Context Burst: A sudden, excessive increase in token usage within one or more components of an agent's context.
Context Conflict: Contradictory instructions or information present within the agent's context.
Context Poisoning: The introduction and propagation of incorrect or hallucinated information into the agent's context.
Context Noise: Redundant, overly similar, or excessive tool definitions or information that can dilute the signal within the context.
Reshape and Fit: Techniques to manage context within the window limits by trimming, compacting, or summarizing.
Isolate and Route: Offloading context and tools to specific sub-agents.
Extract and Retrieve: Methods for extracting, storing, and retrieving key memories.
RAG (Retrieval Augmented Generation): A pattern where retrieved knowledge is used to augment the generation process.
Vector Database: A database optimized for storing and querying high-dimensional vectors, often used for semantic search in RAG.

Introduction to Context Engineering

Context engineering is presented as a foundational element for agent memory. It's defined as both an art, requiring judgment on what information is most critical at any given step, and a science, involving concrete patterns, methods, and measurable impacts for systematic context management. The performance of modern LLMs is heavily dependent on the context provided, not just the model's inherent quality.

Context engineering encompasses various disciplines, including prompt engineering, structured output, RAG (Retrieval Augmented Generation), state management, and history management. Memory, utilizing persistent or semi-persistent storage like files or databases, is a crucial component within this broader sphere.

Core Principles and Strategies:

Why it matters: Long-running and tool-heavy agents can suffer from token limitations, leading to degraded quality through "poisoning," "noise," and "bursting."
Three Core Strategies:
1. Reshape and Fit: Adapting context to fit within the context window.
2. Isolate and Route: Directing the appropriate amount of context to the correct agent.
3. Extract and Retrieve: Obtaining high-quality memories for timely retrieval.
Prompt and Tool Hygiene: Keeping system prompts lean, clear, and well-structured; using a small, canonical set of few-shot examples; minimizing tool overlap; and ensuring effective tool selection.
Northstar Goal: Aiming for the smallest, high-signal context that maximizes the likelihood of the desired outcome.

Challenges in Agent Memory

The finite nature of the LLM context window presents a core bottleneck for agents. Every piece of information added to the prompt (instructions, conversation history, tool outputs) competes for limited token space.

Failure Modes:

Context Burst: A sudden spike in token usage in one or more components due to limited external control or increased calls.
Context Conflict: Contradictory instructions or information within the context.
Context Poisoning: Incorrect information entering and propagating through the context, potentially via summaries or injected memory objects.
Context Noise: An excessive number of redundant or overly similar tool definitions or items in the context.

Examples of Failure Modes:

Context Burst Visualization: A specific turn in a tool-heavy workflow shows a significant increase in tool tokens being injected.
Context Conflict Example: System instructions stating "never issue a refund if warranty status is not active" conflict with a middle-of-turn instruction about eligibility for VIP customers, leading to an agent offering a refund despite potential warranty issues.
Context Poisoning Example: Hallucinations or inaccurate information mixed into the context at any step can propagate across turns, potentially caused by lossy summarization edits or free-form notes that accumulate and contradict newer information.

Solutions and Engineering Techniques

The solution to context management challenges lies in efficiently managing context using techniques like trimming, compaction, state management, and memory. These go beyond basic prompt engineering.

Context Profiles for AI Agents:

RAG-Heavy Assistants: Context dominated by retrieved knowledge and citations (e.g., policy QA agents).
Tool-Heavy Workflows: Context dominated by frequent tool calls and returned payloads.
Conversational Concierges: Context dominated by growing dialogue history, with assistant usage tokens scaling with session length.

Static vs. Dynamic Context:

Static: System instructions, tool definitions, and examples (unless using RAG).
Dynamic: Tool results, retrieved knowledge, memories, and conversation history. These are the areas where context management techniques are applied.

Prompting Best Practices to Avoid Context Conflict and Noise:

Be Explicit and Structured: Use clear, direct language and be specific enough to guide action.
Allow for Planning and Self-Reflection: Increasingly important for reasoning models.
Avoid Conflicts: Keep toolsets small and non-overlapping. Avoid ambiguous definitions.
Favor Targeted Tools: Prefer tools with clear decision boundaries.
Return Meaningful Context: Tools should return high-signal, semantically useful fields and human-readable identifiers.

Reshape and Fit Techniques

These techniques focus on managing context within the available window.

Context Trimming:
- Description: Dropping older conversation turns while retaining the last 'n' turns.
- Benefit: Provides fresh context with better attention, reduces noise, and can increase latency.
- Parameters: Control over the number of turns to keep.
- Heuristics: Analyze sessions, avoid trimming mid-turn, and don't wait to hit context limits (use thresholds like 40-80%).
Context Compaction:
- Description: Dropping tool calls or tool call results from older turns while keeping other messages.
- Benefit: Reduces context size, especially in tool-heavy agents, leading to better attention and faster processing. Tool placeholders are kept intact.
- Heuristics: Similar to trimming, analyze sessions, avoid mid-turn breaks, and use thresholds.
Context Summarization:
- Description: Compressing prior messages into structured summaries and injecting them into the context history as memory objects.
- Benefit: Preserves valuable information by compressing it, leading to fresh context, better attention, and faster processing. Creates a "golden summary" of valuable information.
- Comparison to Trimming:
  - Trimming: Faster, no latency, but might lose information. Best for tool-heavy ops and short workflows.
  - Summarization: Can add latency and cost (due to summarization calls), but preserves all information. Suitable for long-running agents where tasks are dependent and information across turns is crucial.

Isolate and Route Techniques

Tool Offloading to Sub-agents: Assigning specific context and tools to dedicated sub-agents. This minimizes context conflict and poisoning by routing information appropriately.

Extract and Retrieve Techniques

These techniques focus on managing and accessing memories.

Memory Extraction: Using a memory tool to extract memories in live turns, storing them in structured formats (e.g., JSON, markdown).
State Management: Defining a state object with goals and information, which can be injected back into the system prompt across turns or in new sessions.
Memory Retrieval: Performing memory retrieval using a tool, similar to a RAG approach. Memories are stored in a long-term store (e.g., vector DB) and retrieved during live turns through search, filtering, and ranking.

Demo Walkthrough and Examples

The demo showcases an IT troubleshooting agent with two agents running side-by-side.

Initial State: Agents respond to basic greetings and initial issues without memory. Context usage bars show accumulating tokens from system instructions, user input, and agent output.
Context Burst Example:
- User reports an overheating issue.
- User then asks for a refund policy for a specific product.
- The agent makes a tool call (get_refund) and returns a detailed refund policy.
- Observation: A significant spike in token count occurs between turns (e.g., from ~300-400 tokens to over 3000 tokens) due to dumping the entire refund policy into the context. This illustrates context burst. The suggestion is to be more selective about what information is injected.
Reshape and Fit Demo (Trimming):
- Agent P is configured with trimming enabled (max 3 turns).
- User asks about refund policy, then order status.
- User then reports an internet connection issue.
- Observation: At the end of turn six, the context is trimmed, removing older tool outputs and tokens, providing a fresh context for the ongoing internet issue.
Reshape and Fit Demo (Summarization):
- Agent is configured with summarization triggered at turn five, keeping recent three turns.
- User provides extensive details about their MacBook Pro, purchase location, OS updates, and troubleshooting steps (hard reset, FAQ checks).
- Observation: After turn five, a "memory item" (summarized context) appears in orange in the context lifecycle. The agent provides a well-structured response based on the summarized information.
- Summary Prompt: The demo shows a crafted summary prompt for a senior customer support assistant, emphasizing temporal ordering, hallucination control, and including structured factual summaries covering product environment, reported issues, what worked/didn't work, steps tried, identifiers, timelines, tool performance, current status, and next recommended steps.
Long-Term Memory Demo (Cross-Session):
- A summary from a previous session is injected into the system prompt for a new session.
- Observation: The agent on the right responds with a personalized greeting, referencing the previous internet connection issue and the MacBook model. This demonstrates continuity across sessions.
- Memory Instructions: The demo highlights memory instructions that manage the memory object, providing precedence rules, avoiding over-reliance on memory, and adding guardrails against storing secrets.

Memory Shapes and Extraction

Shape of Memory: Start simple (e.g., structured formats, prioritizing what a human would remember) and evolve as needed. The most complex form is a paragraph of memory.
Extraction: Use memory tools to extract memories in live turns, storing them in structured formats like JSON (1-2 sentence notes), using type-safe functions, or markdown.
State Management: Define a state object with goals and information, which can be injected back into the system prompt across turns or in new sessions.
Retrieval: Perform memory retrieval with a tool, similar to RAG, by storing memories in a long-term store (e.g., vector DB) and retrieving them during live turns.

Best Practices and Conclusion

Understand Your Typical Context: Define what is meaningful for your agent.
Decide When and How to Remember and Forget: Promote stable facts to memory, and forget temporarily stale or low-confidence information. Memories should evolve over time; continuously clean, merge, and consolidate them through iterative optimization.
Evals are Crucial: Run your own evaluations to measure improvements with memory enabled versus disabled. Build memory-specific evaluations for long-running tasks and long contexts.

Overall Takeaway: The core idea of agent memory is to better understand what an agent should remember, how it should remember, and how it should forget. Finding the right balance between various techniques is key and depends on the specific use case. This is an evolving field with new features anticipated.

Resources

Context Engineering Cookbook
Context Summarization Cookbook
Agents Python SDK
Full build hour repo on GitHub

Q&A Highlights

Libraries/Packages: OpenAI Agents SDK is recommended as a starting point for implementing context engineering techniques.
Evaluation/Measurement:
- Run regular evals comparing memory on/off to see metric uplifts (completeness, downloads).
- Build memory-specific evals for long-running tasks and long contexts.
- Evaluate summary quality, injection time, and injection prompt.
- Prepare golden datasets and experiment with heuristics for trimming/compaction.
Hierarchical Context (e.g., Project vs. File Edit): Yes, use hierarchical context. Implement "memory scope" (global for user-wide facts, session-based for current interaction). Session memories can graduate to global memories over time.
Keeping Memory Fresh/Pruning:
- Use temporal tags (timestamps) to track when memories were learned, allowing the model to understand what is old vs. new and override stale information.
- Implement decay or window functions to focus on recent memories and downgrade older ones.
- Memory consolidation and override with temporal tags are key.
Scaling Memory Systems (Many Users, Individual/Shared Pools):
- Retrieval/Search-based: Scale vector databases (sharding, embedding model optimization, retrieval process optimization).
- Summarization-based: Scale data management for large amounts of text.
- Pilot Approach: Test new memory techniques with a subgroup of users.
- Understand Memory Evolution: The complexity of memory pools varies by agent type (e.g., travel concierge vs. life coach). Life coach agents require more sophisticated scaling due to the vast amount of personal information.