← All guides

Memory and State in Claude Agents: Patterns That Scale

How to implement short-term, long-term, and semantic memory in Claude agents using the Anthropic SDK — conversation history, vector stores, and external.

Memory and State in Claude Agents: Patterns That Scale

Claude agents have three types of memory: in-context (the conversation messages list), external storage (a database), and semantic memory (a vector store for similarity search). Most production agents need all three. In-context memory is managed through the messages list you pass to the API. External storage persists across sessions. Semantic memory enables retrieval of relevant information without stuffing everything into context. This guide covers practical implementations of each. For a broad introduction to building agents with the Anthropic SDK, see the Claude Agent SDK Guide.


Why memory design matters

The Claude API is stateless — each messages.create() call is independent. Continuity across turns requires you to manage state. The design decisions you make here have major implications for:


Pattern 1: Sliding window (in-context short-term memory)

The simplest memory pattern: keep the last N turns in the messages list. Old turns are discarded.

from collections import deque
from typing import Deque

def run_agent_with_sliding_window(
    user_message: str,
    history: Deque[dict],
    max_turns: int = 10,
) -> tuple[str, Deque[dict]]:
    """
    Maintains a sliding window of the last max_turns messages.
    Returns the response and updated history.
    """
    # Add the new user message
    history.append({"role": "user", "content": user_message})
    
    # Trim to max_turns (in pairs: user + assistant = 1 turn)
    while len(history) > max_turns * 2:
        history.popleft()
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=list(history),
    )
    
    assistant_message = response.content[0].text
    history.append({"role": "assistant", "content": assistant_message})
    
    return assistant_message, history

# Usage
history: Deque[dict] = deque()
reply, history = run_agent_with_sliding_window("What is prompt caching?", history)
reply, history = run_agent_with_sliding_window("How does it save money?", history)

When to use: simple chatbots, short task agents.

Limitations: older context is lost. The agent "forgets" the beginning of long conversations.


Pattern 2: Summary compression

Instead of discarding old turns, compress them into a summary:

def compress_history(
    history: list[dict],
    keep_recent: int = 6,
) -> list[dict]:
    """
    Summarise old turns, keep recent turns verbatim.
    Reduces context length while preserving key information.
    """
    if len(history) <= keep_recent:
        return history
    
    old_turns = history[:-keep_recent]
    recent_turns = history[-keep_recent:]
    
    # Summarise the old turns
    summary_prompt = [
        {
            "role": "user",
            "content": f"""Summarize this conversation history in 3-5 bullet points, 
            preserving key facts, decisions, and user preferences:
            
            {_format_turns(old_turns)}
            
            Output format:
            - [key fact or decision]
            - [key fact or decision]
            ...
            """
        }
    ]
    
    summary_response = client.messages.create(
        model="claude-haiku-4-5",  # use Haiku — this is cheap compression
        max_tokens=256,
        messages=summary_prompt,
    )
    
    summary_text = summary_response.content[0].text
    
    # Prepend summary as system context
    compressed = [
        {
            "role": "user",
            "content": f"[Previous conversation summary: {summary_text}]"
        },
        {
            "role": "assistant",
            "content": "Understood. I have context from our previous conversation."
        },
    ] + recent_turns
    
    return compressed

def _format_turns(history: list[dict]) -> str:
    lines = []
    for msg in history:
        prefix = "User" if msg["role"] == "user" else "Assistant"
        lines.append(f"{prefix}: {msg['content']}")
    return "\n".join(lines)

When to use: long-running conversations where context continuity matters more than verbatim recall.

Cost note: Summary compression uses a cheap Haiku call but saves significantly more in reduced main-context token costs for Sonnet.


Pattern 3: External persistence (cross-session memory)

For agents that serve users across multiple sessions, store conversation state in a database:

import json
import anthropic
from datetime import datetime

# Using Drizzle/Neon via a simple Python psycopg2 connection
import psycopg2

def save_session(
    db_conn,
    session_id: str,
    user_id: str,
    messages: list[dict],
) -> None:
    """Persist conversation state to PostgreSQL."""
    with db_conn.cursor() as cur:
        cur.execute(
            """
            INSERT INTO agent_sessions (session_id, user_id, messages, updated_at)
            VALUES (%s, %s, %s, %s)
            ON CONFLICT (session_id) DO UPDATE
            SET messages = EXCLUDED.messages,
                updated_at = EXCLUDED.updated_at
            """,
            (session_id, user_id, json.dumps(messages), datetime.utcnow()),
        )
    db_conn.commit()

def load_session(db_conn, session_id: str) -> list[dict]:
    """Load conversation state from PostgreSQL."""
    with db_conn.cursor() as cur:
        cur.execute(
            "SELECT messages FROM agent_sessions WHERE session_id = %s",
            (session_id,)
        )
        row = cur.fetchone()
        if row:
            return json.loads(row[0])
    return []

# Usage pattern for a persistent agent
def chat(session_id: str, user_message: str, db_conn) -> str:
    history = load_session(db_conn, session_id)
    history.append({"role": "user", "content": user_message})
    
    # Compress if too long before sending to API
    if len(history) > 20:
        history = compress_history(history)
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=history,
    )
    
    reply = response.content[0].text
    history.append({"role": "assistant", "content": reply})
    save_session(db_conn, session_id, user_id="...", messages=history)
    return reply

Schema for agent_sessions table:

CREATE TABLE agent_sessions (
    session_id TEXT PRIMARY KEY,
    user_id TEXT NOT NULL,
    messages JSONB NOT NULL DEFAULT '[]',
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_agent_sessions_user_id ON agent_sessions(user_id);

Pattern 4: Semantic memory (vector store retrieval)

For agents that need to access large knowledge bases — documents, past interactions, user preferences — vector storage enables retrieval of relevant information without stuffing the full knowledge base into context.

The pattern: embed content as vectors, retrieve top-K relevant items at query time, inject them into context.

# Using OpenAI embeddings (or any embedding model) + pgvector extension on Neon
import numpy as np
from typing import List

def embed_text(text: str) -> List[float]:
    """Get embedding vector for text."""
    # Use Claude's API or any embedding service
    # Here using a simple placeholder
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small",
    )
    return response.data[0].embedding

def store_memory(db_conn, content: str, metadata: dict) -> None:
    """Store a memory with its embedding."""
    embedding = embed_text(content)
    with db_conn.cursor() as cur:
        cur.execute(
            """
            INSERT INTO agent_memories (content, embedding, metadata)
            VALUES (%s, %s, %s)
            """,
            (content, embedding, json.dumps(metadata))
        )
    db_conn.commit()

def retrieve_relevant_memories(
    db_conn,
    query: str,
    top_k: int = 5,
    similarity_threshold: float = 0.7,
) -> List[dict]:
    """Retrieve memories most relevant to the query."""
    query_embedding = embed_text(query)
    
    with db_conn.cursor() as cur:
        cur.execute(
            """
            SELECT content, metadata, 1 - (embedding <=> %s::vector) as similarity
            FROM agent_memories
            WHERE 1 - (embedding <=> %s::vector) > %s
            ORDER BY similarity DESC
            LIMIT %s
            """,
            (query_embedding, query_embedding, similarity_threshold, top_k)
        )
        return [
            {"content": row[0], "metadata": row[1], "similarity": row[2]}
            for row in cur.fetchall()
        ]

def build_context_with_memory(query: str, db_conn) -> str:
    """Build enriched context by retrieving relevant memories."""
    memories = retrieve_relevant_memories(db_conn, query)
    if not memories:
        return ""
    
    memory_text = "\n".join([f"- {m['content']}" for m in memories])
    return f"""Relevant context from memory:
{memory_text}

"""

When to use: agents that serve individual users with personalised context, knowledge-base Q&A agents, research assistants with large document corpora.


Pattern 5: Key-value user preferences

For lighter-weight persistent memory (user preferences, settings, facts), a simple key-value store is often more efficient than a vector store:

from dataclasses import dataclass, asdict
from typing import Optional

@dataclass
class UserMemory:
    preferred_language: str = "python"
    expertise_level: str = "intermediate"
    project_context: str = ""
    last_topic: str = ""

def load_user_memory(db_conn, user_id: str) -> UserMemory:
    with db_conn.cursor() as cur:
        cur.execute(
            "SELECT memory FROM user_memory WHERE user_id = %s",
            (user_id,)
        )
        row = cur.fetchone()
        if row:
            return UserMemory(**json.loads(row[0]))
    return UserMemory()

def inject_user_memory_into_system(memory: UserMemory) -> str:
    """Build a system prompt suffix from user memory."""
    parts = []
    if memory.preferred_language:
        parts.append(f"Preferred language: {memory.preferred_language}")
    if memory.expertise_level:
        parts.append(f"Expertise level: {memory.expertise_level}")
    if memory.project_context:
        parts.append(f"Working on: {memory.project_context}")
    
    if not parts:
        return ""
    
    return "\n\nUser context:\n" + "\n".join(f"- {p}" for p in parts)

Choosing the right pattern

Requirement Pattern
Simple chatbot, short sessions Sliding window
Long conversations, need continuity Summary compression
Multi-session, user returns External persistence (DB)
Large knowledge base, personalisation Semantic memory (vector store)
User preferences, simple facts Key-value store

Most production agents combine patterns 3 + 5 (DB persistence + KV preferences) and add semantic memory when the knowledge base grows beyond what fits in context.


Frequently Asked Questions

How many tokens of history should I keep in context? Aim for 20,000–40,000 tokens of history for most applications. Models perform well up to their full context window, but costs increase linearly. A sliding window of 10 turns (5k–10k tokens typically) plus a compressed summary of earlier turns is a good balance.

Does Claude have built-in memory? No. The API is stateless. claude.ai has built-in memory features for the consumer product, but these are not available through the API. You manage state entirely through the messages list and external storage.

What's the best vector store for small-scale agents? The pgvector PostgreSQL extension (available on Neon) is sufficient for corpora under 1M documents. Pinecone or Weaviate add operational complexity only worth it at larger scale. For very simple use cases, a cosine similarity search over stored embeddings in a JSON column works fine at low scale.

How do I handle memory for concurrent users? Each user has their own session ID, which maps to their own state in the database. There's no shared state between users. Use database transactions to prevent race conditions if a user makes requests in parallel.

When should I use prompt caching instead of (or alongside) memory patterns? Prompt caching and memory management solve different problems. Caching reduces the cost of re-sending the same large prefix on each API call — useful for system prompts, tool schemas, and a static document. Memory patterns manage what you include in that prefix over time. In a long-running agent, combine both: cache the static system prompt and tools, and apply summary compression to keep the dynamic conversation history short. See Prompt Caching in Claude Agent SDK for the implementation.


Take It Further

Claude Agent SDK Cookbook: 40 Production Patterns — Pattern 18 covers the full Memory Architecture: sliding window, compression, vector retrieval, and the decision framework for choosing between them. Complete Python code included.

→ Get the Agent SDK Cookbook — $49

30-day money-back guarantee. Instant download.

AI Disclosure: Drafted with Claude Code; all SDK patterns from official Anthropic documentation as of April 2026.

Tools and references