Memory and State in Claude Agents: Patterns That Scale
Claude agents don't have persistent memory between API calls — each call starts fresh. Adding memory means deciding what to store, where to store it, and how much to bring back into context on the next call. The four patterns that cover 90% of production needs are: conversation history (in-context), summary compression (compressed context), external memory (vector search), and explicit state (structured data). This guide covers when to use each and how to implement them.
The Memory Problem
# Call 1
client.messages.create(messages=[{"role": "user", "content": "My name is Alex"}])
# Claude: "Hi Alex!"
# Call 2 — Claude has no memory of call 1
client.messages.create(messages=[{"role": "user", "content": "What's my name?"}])
# Claude: "I don't know your name."
Every conversation must carry its own context. The question is how much and in what form.
Pattern 1: Full Conversation History (In-Context)
The simplest approach — append every turn to a running messages list.
import anthropic
client = anthropic.Anthropic()
class ConversationAgent:
"""Agent that maintains full conversation history in context."""
def __init__(self, system: str, max_turns: int = 50):
self.system = system
self.messages = []
self.max_turns = max_turns
def chat(self, user_message: str) -> str:
self.messages.append({"role": "user", "content": user_message})
# Trim if approaching limits (rough estimate: 2k tokens per turn)
if len(self.messages) > self.max_turns * 2:
self.messages = self.messages[-self.max_turns * 2:]
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=self.system,
messages=self.messages
)
assistant_message = response.content[0].text
self.messages.append({"role": "assistant", "content": assistant_message})
return assistant_message
def clear(self):
self.messages = []
@property
def token_estimate(self) -> int:
"""Rough estimate of context token usage."""
total_chars = sum(len(m["content"]) for m in self.messages)
return total_chars // 4 # ~4 chars per token
# Usage
agent = ConversationAgent(
system="You are a helpful coding assistant. Remember context from earlier in the conversation."
)
print(agent.chat("I'm building a user authentication system in FastAPI"))
print(agent.chat("What's the best way to handle JWT refresh tokens?"))
print(agent.chat("Apply that pattern to the auth module we discussed"))
# Claude remembers the FastAPI context from earlier
When to use: Sessions under 30 minutes, 10-30 turns, topics that all relate to each other.
Limit: At ~100K tokens of history, quality degrades and costs increase substantially. A 50-turn conversation at 2K tokens/turn = 100K input tokens per API call = ~$0.30 per call at Sonnet pricing.
Pattern 2: Summary Compression
When conversation history grows long, compress older turns into a summary and keep only recent turns verbatim.
from dataclasses import dataclass
@dataclass
class CompressedHistory:
summary: str # Compressed summary of old turns
recent_messages: list # Last N turns verbatim
turns_summarized: int
class SummaryCompressionAgent:
"""Agent that compresses old conversation turns into summaries."""
def __init__(self, system: str, verbatim_turns: int = 10, compress_every: int = 20):
self.system = system
self.messages = []
self.summary = ""
self.verbatim_turns = verbatim_turns # Keep this many turns uncompressed
self.compress_every = compress_every # Compress after this many total turns
def _compress_history(self):
"""Summarize old turns, keep recent ones verbatim."""
turns_to_compress = self.messages[:-self.verbatim_turns * 2]
recent = self.messages[-self.verbatim_turns * 2:]
if not turns_to_compress:
return
# Summarize old turns
turns_text = "\n".join(
f"{m['role'].upper()}: {m['content']}" for m in turns_to_compress
)
response = client.messages.create(
model="claude-haiku-4-5", # Use cheaper model for summarization
max_tokens=500,
messages=[{
"role": "user",
"content": f"""Summarize this conversation context into 3-5 bullet points.
Focus on: decisions made, key facts established, current task state.
CONVERSATION:
{turns_text}
Output bullet points only."""
}]
)
new_summary = response.content[0].text
if self.summary:
self.summary = f"{self.summary}\n\n[Later:] {new_summary}"
else:
self.summary = new_summary
self.messages = recent
def chat(self, user_message: str) -> str:
self.messages.append({"role": "user", "content": user_message})
# Compress if needed
if len(self.messages) > self.compress_every * 2:
self._compress_history()
# Build system with summary context
system_with_context = self.system
if self.summary:
system_with_context += f"\n\n## Earlier conversation summary\n{self.summary}"
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=system_with_context,
messages=self.messages
)
assistant_message = response.content[0].text
self.messages.append({"role": "assistant", "content": assistant_message})
return assistant_message
When to use: Long sessions (1+ hour), 30-100+ turns, ongoing work where you need to reference earlier decisions.
Trade-off: Summary loses detail. Good for "remember we decided to use Postgres" — bad for "repeat my exact code from 40 turns ago."
Pattern 3: External Memory with Vector Search
Store facts and embeddings in a vector database, retrieve relevant memories at query time.
import json
import numpy as np
from pathlib import Path
class VectorMemory:
"""Simple in-memory vector store for agent memories."""
def __init__(self, persist_path: str = None):
self.memories = [] # [{"text": ..., "embedding": [...], "metadata": {...}}]
self.persist_path = persist_path
if persist_path and Path(persist_path).exists():
self._load()
def add(self, text: str, metadata: dict = None):
"""Add a memory with its embedding."""
embedding = self._embed(text)
self.memories.append({
"text": text,
"embedding": embedding,
"metadata": metadata or {}
})
if self.persist_path:
self._save()
def search(self, query: str, top_k: int = 5) -> list[dict]:
"""Retrieve top-k most relevant memories."""
if not self.memories:
return []
query_embedding = self._embed(query)
# Cosine similarity
scores = []
for memory in self.memories:
sim = self._cosine_similarity(query_embedding, memory["embedding"])
scores.append((sim, memory))
scores.sort(key=lambda x: x[0], reverse=True)
return [m for _, m in scores[:top_k]]
def _embed(self, text: str) -> list[float]:
"""Get embedding for text using Claude's embedding approach.
In production: use voyage-3 or text-embedding-3-small."""
# Simplified: use Claude to create a hash-like summary vector
# Real implementation should use an embedding API
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=100,
messages=[{
"role": "user",
"content": f"Summarize in exactly 20 keywords, comma-separated: {text[:500]}"
}]
)
keywords = response.content[0].text.split(',')
# Create simple bag-of-words vector (use real embeddings in production)
return [hash(kw.strip().lower()) % 1000 / 1000.0 for kw in keywords[:20]]
def _cosine_similarity(self, a: list, b: list) -> float:
a_arr = np.array(a)
b_arr = np.array(b)
if np.linalg.norm(a_arr) == 0 or np.linalg.norm(b_arr) == 0:
return 0.0
return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))
def _save(self):
Path(self.persist_path).write_text(json.dumps(self.memories))
def _load(self):
self.memories = json.loads(Path(self.persist_path).read_text())
class MemoryAgent:
"""Agent that stores and retrieves facts from vector memory."""
def __init__(self, system: str, memory_path: str = "/tmp/agent_memory.json"):
self.system = system
self.memory = VectorMemory(persist_path=memory_path)
self.messages = []
def remember(self, fact: str, metadata: dict = None):
"""Explicitly store a fact in memory."""
self.memory.add(fact, metadata)
def chat(self, user_message: str) -> str:
# Retrieve relevant memories
relevant_memories = self.memory.search(user_message, top_k=5)
memory_context = ""
if relevant_memories:
memory_context = "\n\n## Relevant memories from earlier sessions\n" + "\n".join(
f"- {m['text']}" for m in relevant_memories
)
self.messages.append({"role": "user", "content": user_message})
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=self.system + memory_context,
messages=self.messages[-20:] # Keep last 20 turns
)
response_text = response.content[0].text
# Auto-extract facts from conversation to remember
self._auto_remember(user_message, response_text)
self.messages.append({"role": "assistant", "content": response_text})
return response_text
def _auto_remember(self, user_msg: str, assistant_msg: str):
"""Extract and store important facts from the conversation."""
# Use Claude to decide what's worth remembering
extraction = client.messages.create(
model="claude-haiku-4-5",
max_tokens=200,
messages=[{
"role": "user",
"content": f"""From this conversation exchange, extract facts worth remembering for future sessions.
Output as JSON array of strings, or empty array [] if nothing worth saving.
Only save specific, factual information (preferences, decisions, facts about the user's system).
USER: {user_msg[:300]}
ASSISTANT: {assistant_msg[:300]}
JSON array only:"""
}]
)
try:
import re
json_match = re.search(r'\[.*\]', extraction.content[0].text, re.DOTALL)
if json_match:
facts = json.loads(json_match.group())
for fact in facts[:3]: # Max 3 facts per turn
if isinstance(fact, str) and len(fact) > 10:
self.memory.add(fact)
except (json.JSONDecodeError, AttributeError):
pass
When to use: Customer support agents that need to remember user preferences across sessions, personal assistants with long-term context, knowledge management agents.
Cost: Embedding generation adds API calls. For production, use Voyage AI or OpenAI embeddings instead of Claude for embedding generation — 10x cheaper.
Pattern 4: Explicit State (Structured Data)
For task-oriented agents, maintain explicit state as a Python object. Don't rely on the LLM to "remember" task status — track it in code.
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import json
class TaskStatus(Enum):
PENDING = "pending"
IN_PROGRESS = "in_progress"
BLOCKED = "blocked"
DONE = "done"
@dataclass
class Task:
id: str
description: str
status: TaskStatus = TaskStatus.PENDING
result: Optional[str] = None
subtasks: list = field(default_factory=list)
@dataclass
class AgentState:
"""Explicit state for a task-executing agent."""
goal: str
tasks: list[Task] = field(default_factory=list)
facts: dict = field(default_factory=dict) # key facts discovered
turn: int = 0
completed: bool = False
def add_fact(self, key: str, value: str):
self.facts[key] = value
def to_context_string(self) -> str:
"""Serialize state to inject into Claude's context."""
task_lines = []
for t in self.tasks:
prefix = {"pending": "⏳", "in_progress": "🔄", "blocked": "⛔", "done": "✅"}[t.status.value]
task_lines.append(f"{prefix} {t.id}: {t.description}")
if t.result:
task_lines.append(f" Result: {t.result[:100]}")
facts_str = "\n".join(f"- {k}: {v}" for k, v in self.facts.items())
return f"""## Current State (Turn {self.turn})
Goal: {self.goal}
Tasks:
{chr(10).join(task_lines) if task_lines else '(none defined yet)'}
Known facts:
{facts_str if facts_str else '(none yet)'}"""
class StatefulAgent:
"""Agent that maintains explicit structured state."""
def __init__(self, goal: str):
self.state = AgentState(goal=goal)
self.messages = []
def run(self, max_turns: int = 20) -> AgentState:
tools = [
{
"name": "update_task_status",
"description": "Update the status and result of a task",
"input_schema": {
"type": "object",
"properties": {
"task_id": {"type": "string"},
"status": {"type": "string", "enum": ["pending", "in_progress", "blocked", "done"]},
"result": {"type": "string", "description": "Task result or reason for block"}
},
"required": ["task_id", "status"]
}
},
{
"name": "add_fact",
"description": "Store an important fact discovered during task execution",
"input_schema": {
"type": "object",
"properties": {
"key": {"type": "string"},
"value": {"type": "string"}
},
"required": ["key", "value"]
}
},
{
"name": "mark_complete",
"description": "Mark the overall goal as complete",
"input_schema": {
"type": "object",
"properties": {
"summary": {"type": "string"}
},
"required": ["summary"]
}
}
]
# Initial planning message
self.messages.append({
"role": "user",
"content": f"Plan and execute: {self.state.goal}"
})
while self.state.turn < max_turns and not self.state.completed:
self.state.turn += 1
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
system=f"""You are a task execution agent. Track your state using the provided tools.
{self.state.to_context_string()}""",
tools=tools,
messages=self.messages
)
self.messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
break
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = self._handle_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
self.messages.append({"role": "user", "content": tool_results})
return self.state
def _handle_tool(self, name: str, inputs: dict) -> str:
if name == "update_task_status":
task_id = inputs["task_id"]
task = next((t for t in self.state.tasks if t.id == task_id), None)
if not task:
task = Task(id=task_id, description=task_id)
self.state.tasks.append(task)
task.status = TaskStatus(inputs["status"])
if "result" in inputs:
task.result = inputs["result"]
return f"Task {task_id} updated to {inputs['status']}"
elif name == "add_fact":
self.state.add_fact(inputs["key"], inputs["value"])
return f"Fact stored: {inputs['key']}"
elif name == "mark_complete":
self.state.completed = True
return "Goal marked complete"
return "Unknown tool"
Choosing the Right Pattern
| Scenario | Pattern |
|---|---|
| Single-session chatbot (< 1hr) | Full conversation history |
| Long coding session (1-3hr) | Summary compression |
| Multi-session personal assistant | External memory |
| Task execution agent | Explicit state |
| Production support agent | External memory + explicit state |
Frequently Asked Questions
How do I persist memory across Claude Code sessions? Write memory to disk: a JSON file, SQLite database, or Redis. Load it at the start of each session. Claude Code's auto-memory in CLAUDE.md works for project context; for user-specific memory, you need explicit storage.
What's the cheapest way to add memory to an agent? Summary compression using claude-haiku-4-5 for the summary generation step. It costs ~$0.001 per summary and keeps your main model context small, reducing per-call costs.
Should I use OpenAI embeddings or Voyage AI for external memory? Voyage AI (voyage-3) is purpose-built for retrieval and outperforms both OpenAI ada-002 and OpenAI's newer models on most code and technical text benchmarks. Voyage-3 at $0.06/M tokens is also cheaper than text-embedding-3-large.
How many memories can an agent hold before retrieval quality degrades? Vector search stays effective up to ~10,000 memories. At 50,000+, consider adding metadata filters (by date, topic, user) before the vector search step to narrow the search space.
Related Guides
- Claude Agent SDK: Build Automation Agents — SDK fundamentals
- Claude Agent Observability — Tracking token usage by session
- How to Handle Errors and Retries — Error recovery
Go Deeper
Agent SDK Cookbook — $49 — Full memory implementations: Postgres-backed conversation history with automatic pruning, Voyage AI embedding integration, Redis session state, and multi-user memory isolation patterns.
→ Get the Agent SDK Cookbook — $49
30-day money-back guarantee. Instant download.