← All guides

Memory and State in Claude Agents: Patterns That Scale

How to add persistent memory to Claude agents — conversation history management, external memory with embeddings, session state patterns, and the.

Memory and State in Claude Agents: Patterns That Scale

Claude agents don't have persistent memory between API calls — each call starts fresh. Adding memory means deciding what to store, where to store it, and how much to bring back into context on the next call. The four patterns that cover 90% of production needs are: conversation history (in-context), summary compression (compressed context), external memory (vector search), and explicit state (structured data). This guide covers when to use each and how to implement them.


The Memory Problem

# Call 1
client.messages.create(messages=[{"role": "user", "content": "My name is Alex"}])
# Claude: "Hi Alex!"

# Call 2 — Claude has no memory of call 1
client.messages.create(messages=[{"role": "user", "content": "What's my name?"}])
# Claude: "I don't know your name."

Every conversation must carry its own context. The question is how much and in what form.


Pattern 1: Full Conversation History (In-Context)

The simplest approach — append every turn to a running messages list.

import anthropic

client = anthropic.Anthropic()


class ConversationAgent:
    """Agent that maintains full conversation history in context."""

    def __init__(self, system: str, max_turns: int = 50):
        self.system = system
        self.messages = []
        self.max_turns = max_turns

    def chat(self, user_message: str) -> str:
        self.messages.append({"role": "user", "content": user_message})

        # Trim if approaching limits (rough estimate: 2k tokens per turn)
        if len(self.messages) > self.max_turns * 2:
            self.messages = self.messages[-self.max_turns * 2:]

        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1024,
            system=self.system,
            messages=self.messages
        )

        assistant_message = response.content[0].text
        self.messages.append({"role": "assistant", "content": assistant_message})
        return assistant_message

    def clear(self):
        self.messages = []

    @property
    def token_estimate(self) -> int:
        """Rough estimate of context token usage."""
        total_chars = sum(len(m["content"]) for m in self.messages)
        return total_chars // 4  # ~4 chars per token


# Usage
agent = ConversationAgent(
    system="You are a helpful coding assistant. Remember context from earlier in the conversation."
)

print(agent.chat("I'm building a user authentication system in FastAPI"))
print(agent.chat("What's the best way to handle JWT refresh tokens?"))
print(agent.chat("Apply that pattern to the auth module we discussed"))
# Claude remembers the FastAPI context from earlier

When to use: Sessions under 30 minutes, 10-30 turns, topics that all relate to each other.

Limit: At ~100K tokens of history, quality degrades and costs increase substantially. A 50-turn conversation at 2K tokens/turn = 100K input tokens per API call = ~$0.30 per call at Sonnet pricing.


Pattern 2: Summary Compression

When conversation history grows long, compress older turns into a summary and keep only recent turns verbatim.

from dataclasses import dataclass


@dataclass
class CompressedHistory:
    summary: str           # Compressed summary of old turns
    recent_messages: list  # Last N turns verbatim
    turns_summarized: int


class SummaryCompressionAgent:
    """Agent that compresses old conversation turns into summaries."""

    def __init__(self, system: str, verbatim_turns: int = 10, compress_every: int = 20):
        self.system = system
        self.messages = []
        self.summary = ""
        self.verbatim_turns = verbatim_turns  # Keep this many turns uncompressed
        self.compress_every = compress_every   # Compress after this many total turns

    def _compress_history(self):
        """Summarize old turns, keep recent ones verbatim."""
        turns_to_compress = self.messages[:-self.verbatim_turns * 2]
        recent = self.messages[-self.verbatim_turns * 2:]

        if not turns_to_compress:
            return

        # Summarize old turns
        turns_text = "\n".join(
            f"{m['role'].upper()}: {m['content']}" for m in turns_to_compress
        )

        response = client.messages.create(
            model="claude-haiku-4-5",  # Use cheaper model for summarization
            max_tokens=500,
            messages=[{
                "role": "user",
                "content": f"""Summarize this conversation context into 3-5 bullet points.
Focus on: decisions made, key facts established, current task state.

CONVERSATION:
{turns_text}

Output bullet points only."""
            }]
        )

        new_summary = response.content[0].text
        if self.summary:
            self.summary = f"{self.summary}\n\n[Later:] {new_summary}"
        else:
            self.summary = new_summary

        self.messages = recent

    def chat(self, user_message: str) -> str:
        self.messages.append({"role": "user", "content": user_message})

        # Compress if needed
        if len(self.messages) > self.compress_every * 2:
            self._compress_history()

        # Build system with summary context
        system_with_context = self.system
        if self.summary:
            system_with_context += f"\n\n## Earlier conversation summary\n{self.summary}"

        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1024,
            system=system_with_context,
            messages=self.messages
        )

        assistant_message = response.content[0].text
        self.messages.append({"role": "assistant", "content": assistant_message})
        return assistant_message

When to use: Long sessions (1+ hour), 30-100+ turns, ongoing work where you need to reference earlier decisions.

Trade-off: Summary loses detail. Good for "remember we decided to use Postgres" — bad for "repeat my exact code from 40 turns ago."


Pattern 3: External Memory with Vector Search

Store facts and embeddings in a vector database, retrieve relevant memories at query time.

import json
import numpy as np
from pathlib import Path


class VectorMemory:
    """Simple in-memory vector store for agent memories."""

    def __init__(self, persist_path: str = None):
        self.memories = []  # [{"text": ..., "embedding": [...], "metadata": {...}}]
        self.persist_path = persist_path

        if persist_path and Path(persist_path).exists():
            self._load()

    def add(self, text: str, metadata: dict = None):
        """Add a memory with its embedding."""
        embedding = self._embed(text)
        self.memories.append({
            "text": text,
            "embedding": embedding,
            "metadata": metadata or {}
        })
        if self.persist_path:
            self._save()

    def search(self, query: str, top_k: int = 5) -> list[dict]:
        """Retrieve top-k most relevant memories."""
        if not self.memories:
            return []

        query_embedding = self._embed(query)

        # Cosine similarity
        scores = []
        for memory in self.memories:
            sim = self._cosine_similarity(query_embedding, memory["embedding"])
            scores.append((sim, memory))

        scores.sort(key=lambda x: x[0], reverse=True)
        return [m for _, m in scores[:top_k]]

    def _embed(self, text: str) -> list[float]:
        """Get embedding for text using Claude's embedding approach.
        In production: use voyage-3 or text-embedding-3-small."""
        # Simplified: use Claude to create a hash-like summary vector
        # Real implementation should use an embedding API
        response = client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=100,
            messages=[{
                "role": "user",
                "content": f"Summarize in exactly 20 keywords, comma-separated: {text[:500]}"
            }]
        )
        keywords = response.content[0].text.split(',')
        # Create simple bag-of-words vector (use real embeddings in production)
        return [hash(kw.strip().lower()) % 1000 / 1000.0 for kw in keywords[:20]]

    def _cosine_similarity(self, a: list, b: list) -> float:
        a_arr = np.array(a)
        b_arr = np.array(b)
        if np.linalg.norm(a_arr) == 0 or np.linalg.norm(b_arr) == 0:
            return 0.0
        return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))

    def _save(self):
        Path(self.persist_path).write_text(json.dumps(self.memories))

    def _load(self):
        self.memories = json.loads(Path(self.persist_path).read_text())


class MemoryAgent:
    """Agent that stores and retrieves facts from vector memory."""

    def __init__(self, system: str, memory_path: str = "/tmp/agent_memory.json"):
        self.system = system
        self.memory = VectorMemory(persist_path=memory_path)
        self.messages = []

    def remember(self, fact: str, metadata: dict = None):
        """Explicitly store a fact in memory."""
        self.memory.add(fact, metadata)

    def chat(self, user_message: str) -> str:
        # Retrieve relevant memories
        relevant_memories = self.memory.search(user_message, top_k=5)

        memory_context = ""
        if relevant_memories:
            memory_context = "\n\n## Relevant memories from earlier sessions\n" + "\n".join(
                f"- {m['text']}" for m in relevant_memories
            )

        self.messages.append({"role": "user", "content": user_message})

        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1024,
            system=self.system + memory_context,
            messages=self.messages[-20:]  # Keep last 20 turns
        )

        response_text = response.content[0].text

        # Auto-extract facts from conversation to remember
        self._auto_remember(user_message, response_text)

        self.messages.append({"role": "assistant", "content": response_text})
        return response_text

    def _auto_remember(self, user_msg: str, assistant_msg: str):
        """Extract and store important facts from the conversation."""
        # Use Claude to decide what's worth remembering
        extraction = client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=200,
            messages=[{
                "role": "user",
                "content": f"""From this conversation exchange, extract facts worth remembering for future sessions.
Output as JSON array of strings, or empty array [] if nothing worth saving.
Only save specific, factual information (preferences, decisions, facts about the user's system).

USER: {user_msg[:300]}
ASSISTANT: {assistant_msg[:300]}

JSON array only:"""
            }]
        )

        try:
            import re
            json_match = re.search(r'\[.*\]', extraction.content[0].text, re.DOTALL)
            if json_match:
                facts = json.loads(json_match.group())
                for fact in facts[:3]:  # Max 3 facts per turn
                    if isinstance(fact, str) and len(fact) > 10:
                        self.memory.add(fact)
        except (json.JSONDecodeError, AttributeError):
            pass

When to use: Customer support agents that need to remember user preferences across sessions, personal assistants with long-term context, knowledge management agents.

Cost: Embedding generation adds API calls. For production, use Voyage AI or OpenAI embeddings instead of Claude for embedding generation — 10x cheaper.


Pattern 4: Explicit State (Structured Data)

For task-oriented agents, maintain explicit state as a Python object. Don't rely on the LLM to "remember" task status — track it in code.

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import json


class TaskStatus(Enum):
    PENDING = "pending"
    IN_PROGRESS = "in_progress"
    BLOCKED = "blocked"
    DONE = "done"


@dataclass
class Task:
    id: str
    description: str
    status: TaskStatus = TaskStatus.PENDING
    result: Optional[str] = None
    subtasks: list = field(default_factory=list)


@dataclass
class AgentState:
    """Explicit state for a task-executing agent."""
    goal: str
    tasks: list[Task] = field(default_factory=list)
    facts: dict = field(default_factory=dict)  # key facts discovered
    turn: int = 0
    completed: bool = False

    def add_fact(self, key: str, value: str):
        self.facts[key] = value

    def to_context_string(self) -> str:
        """Serialize state to inject into Claude's context."""
        task_lines = []
        for t in self.tasks:
            prefix = {"pending": "⏳", "in_progress": "🔄", "blocked": "⛔", "done": "✅"}[t.status.value]
            task_lines.append(f"{prefix} {t.id}: {t.description}")
            if t.result:
                task_lines.append(f"   Result: {t.result[:100]}")

        facts_str = "\n".join(f"- {k}: {v}" for k, v in self.facts.items())

        return f"""## Current State (Turn {self.turn})
Goal: {self.goal}

Tasks:
{chr(10).join(task_lines) if task_lines else '(none defined yet)'}

Known facts:
{facts_str if facts_str else '(none yet)'}"""


class StatefulAgent:
    """Agent that maintains explicit structured state."""

    def __init__(self, goal: str):
        self.state = AgentState(goal=goal)
        self.messages = []

    def run(self, max_turns: int = 20) -> AgentState:
        tools = [
            {
                "name": "update_task_status",
                "description": "Update the status and result of a task",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "task_id": {"type": "string"},
                        "status": {"type": "string", "enum": ["pending", "in_progress", "blocked", "done"]},
                        "result": {"type": "string", "description": "Task result or reason for block"}
                    },
                    "required": ["task_id", "status"]
                }
            },
            {
                "name": "add_fact",
                "description": "Store an important fact discovered during task execution",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "key": {"type": "string"},
                        "value": {"type": "string"}
                    },
                    "required": ["key", "value"]
                }
            },
            {
                "name": "mark_complete",
                "description": "Mark the overall goal as complete",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "summary": {"type": "string"}
                    },
                    "required": ["summary"]
                }
            }
        ]

        # Initial planning message
        self.messages.append({
            "role": "user",
            "content": f"Plan and execute: {self.state.goal}"
        })

        while self.state.turn < max_turns and not self.state.completed:
            self.state.turn += 1

            response = client.messages.create(
                model="claude-sonnet-4-5",
                max_tokens=2048,
                system=f"""You are a task execution agent. Track your state using the provided tools.
                
{self.state.to_context_string()}""",
                tools=tools,
                messages=self.messages
            )

            self.messages.append({"role": "assistant", "content": response.content})

            if response.stop_reason == "end_turn":
                break

            if response.stop_reason == "tool_use":
                tool_results = []
                for block in response.content:
                    if block.type == "tool_use":
                        result = self._handle_tool(block.name, block.input)
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": result
                        })
                self.messages.append({"role": "user", "content": tool_results})

        return self.state

    def _handle_tool(self, name: str, inputs: dict) -> str:
        if name == "update_task_status":
            task_id = inputs["task_id"]
            task = next((t for t in self.state.tasks if t.id == task_id), None)
            if not task:
                task = Task(id=task_id, description=task_id)
                self.state.tasks.append(task)
            task.status = TaskStatus(inputs["status"])
            if "result" in inputs:
                task.result = inputs["result"]
            return f"Task {task_id} updated to {inputs['status']}"

        elif name == "add_fact":
            self.state.add_fact(inputs["key"], inputs["value"])
            return f"Fact stored: {inputs['key']}"

        elif name == "mark_complete":
            self.state.completed = True
            return "Goal marked complete"

        return "Unknown tool"

Choosing the Right Pattern

Scenario Pattern
Single-session chatbot (< 1hr) Full conversation history
Long coding session (1-3hr) Summary compression
Multi-session personal assistant External memory
Task execution agent Explicit state
Production support agent External memory + explicit state

Frequently Asked Questions

How do I persist memory across Claude Code sessions? Write memory to disk: a JSON file, SQLite database, or Redis. Load it at the start of each session. Claude Code's auto-memory in CLAUDE.md works for project context; for user-specific memory, you need explicit storage.

What's the cheapest way to add memory to an agent? Summary compression using claude-haiku-4-5 for the summary generation step. It costs ~$0.001 per summary and keeps your main model context small, reducing per-call costs.

Should I use OpenAI embeddings or Voyage AI for external memory? Voyage AI (voyage-3) is purpose-built for retrieval and outperforms both OpenAI ada-002 and OpenAI's newer models on most code and technical text benchmarks. Voyage-3 at $0.06/M tokens is also cheaper than text-embedding-3-large.

How many memories can an agent hold before retrieval quality degrades? Vector search stays effective up to ~10,000 memories. At 50,000+, consider adding metadata filters (by date, topic, user) before the vector search step to narrow the search space.


Related Guides


Go Deeper

Agent SDK Cookbook — $49 — Full memory implementations: Postgres-backed conversation history with automatic pruning, Voyage AI embedding integration, Redis session state, and multi-user memory isolation patterns.

→ Get the Agent SDK Cookbook — $49

30-day money-back guarantee. Instant download.

AI Disclosure: Written with Claude Code; patterns based on published SDK documentation and recommended best practices.

Tools and references