← All guides

Claude Agent Observability: Logging, Tracing, and Debugging Production Agents

How to add observability to production Claude agents — structured logging, trace IDs, token usage tracking, cost dashboards, and the debugging patterns.

Claude Agent Observability: Logging, Tracing, and Debugging Production Agents

Production Claude agents need three observability layers: structured logging of every LLM call with token counts and latency, trace IDs that connect multi-turn conversations to individual requests, and a cost dashboard that shows per-user API spending before your bill arrives. Without these, debugging agent failures is guesswork and cost surprises are inevitable. This guide covers the full observability stack for production Claude agents, from structured logging to cost alerts.


Why Agent Observability Is Different

Standard web application observability tracks HTTP requests: status codes, latency, errors. This covers the surface of agent behavior but misses the most important signals:

You need purpose-built agent observability on top of standard infrastructure monitoring.


Layer 1: Structured Logging

Every API call should emit a structured log event — not a print statement, a JSON record.

Python logging setup

import logging
import json
import time
import uuid
from dataclasses import dataclass, asdict
from typing import Any
import anthropic

# Configure structured logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("claude_agent")


@dataclass
class LLMCallEvent:
    event_type: str = "llm_call"
    trace_id: str = ""
    session_id: str = ""
    user_id: str = ""
    model: str = ""
    input_tokens: int = 0
    output_tokens: int = 0
    cache_read_tokens: int = 0
    cache_write_tokens: int = 0
    latency_ms: float = 0.0
    stop_reason: str = ""
    tool_calls: list = None
    cost_usd: float = 0.0
    error: str = ""

    def __post_init__(self):
        if self.tool_calls is None:
            self.tool_calls = []


def calculate_cost(model: str, input_tokens: int, output_tokens: int,
                   cache_read_tokens: int = 0) -> float:
    """Calculate cost in USD."""
    pricing = {
        "claude-haiku-4-5":  {"input": 0.80, "output": 4.00, "cache_read": 0.08},
        "claude-sonnet-4-5": {"input": 3.00, "output": 15.00, "cache_read": 0.30},
        "claude-opus-4-7":   {"input": 15.00, "output": 75.00, "cache_read": 1.50},
    }
    p = pricing.get(model, pricing["claude-sonnet-4-5"])
    billed_input = input_tokens - cache_read_tokens  # Non-cached tokens
    return (
        billed_input * p["input"] / 1_000_000 +
        cache_read_tokens * p["cache_read"] / 1_000_000 +
        output_tokens * p["output"] / 1_000_000
    )


class ObservableAnthropicClient:
    """Anthropic client wrapper that emits structured logs."""

    def __init__(self, session_id: str = None, user_id: str = None):
        self.client = anthropic.Anthropic()
        self.session_id = session_id or str(uuid.uuid4())
        self.user_id = user_id or "anonymous"

    def create(self, trace_id: str = None, **kwargs) -> Any:
        """Create a message with full observability."""
        trace_id = trace_id or str(uuid.uuid4())
        event = LLMCallEvent(
            trace_id=trace_id,
            session_id=self.session_id,
            user_id=self.user_id,
            model=kwargs.get("model", "unknown"),
        )

        start = time.time()
        try:
            response = self.client.messages.create(**kwargs)

            # Extract usage
            usage = response.usage
            event.input_tokens = usage.input_tokens
            event.output_tokens = usage.output_tokens
            event.cache_read_tokens = getattr(usage, "cache_read_input_tokens", 0)
            event.cache_write_tokens = getattr(usage, "cache_creation_input_tokens", 0)
            event.stop_reason = response.stop_reason or ""

            # Extract tool calls
            event.tool_calls = [
                {"name": b.name, "input_keys": list(b.input.keys())}
                for b in response.content if b.type == "tool_use"
            ]

            # Calculate cost
            event.cost_usd = calculate_cost(
                event.model, event.input_tokens,
                event.output_tokens, event.cache_read_tokens
            )

        except Exception as e:
            event.error = f"{type(e).__name__}: {str(e)}"
            raise
        finally:
            event.latency_ms = (time.time() - start) * 1000
            logger.info(json.dumps(asdict(event)))

        return response

Usage in an agent

client = ObservableAnthropicClient(
    session_id="conv_abc123",
    user_id="user_789"
)

response = client.create(
    trace_id="turn_1",
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Summarize this report..."}]
)

Output log:

{
  "event_type": "llm_call",
  "trace_id": "turn_1",
  "session_id": "conv_abc123",
  "user_id": "user_789",
  "model": "claude-sonnet-4-5",
  "input_tokens": 1842,
  "output_tokens": 312,
  "cache_read_tokens": 1200,
  "cache_write_tokens": 0,
  "latency_ms": 2341.5,
  "stop_reason": "end_turn",
  "tool_calls": [],
  "cost_usd": 0.004326,
  "error": ""
}

Layer 2: Conversation Tracing

Link every turn in a conversation with a consistent conversation ID, so you can replay the full context when debugging.

import json
from pathlib import Path


class ConversationTrace:
    """Records a complete conversation for debugging."""

    def __init__(self, conversation_id: str, user_id: str, log_dir: str = "/tmp/traces"):
        self.conversation_id = conversation_id
        self.user_id = user_id
        self.log_dir = Path(log_dir)
        self.log_dir.mkdir(exist_ok=True)
        self.turns = []

    def record_turn(self, turn_num: int, messages: list, response_content: list,
                    usage: dict, tool_results: list = None):
        """Record a single conversation turn."""
        self.turns.append({
            "turn": turn_num,
            "messages_sent": len(messages),
            "response_content_types": [b.get("type") if isinstance(b, dict) else b.type
                                        for b in response_content],
            "usage": usage,
            "tool_calls": [b.name for b in response_content
                           if hasattr(b, "type") and b.type == "tool_use"],
            "tool_results_count": len(tool_results) if tool_results else 0,
        })

    def save(self):
        """Persist the trace to disk."""
        trace_path = self.log_dir / f"trace_{self.conversation_id}.json"
        trace_path.write_text(json.dumps({
            "conversation_id": self.conversation_id,
            "user_id": self.user_id,
            "total_turns": len(self.turns),
            "turns": self.turns,
        }, indent=2))
        return str(trace_path)


# Usage in agent loop
trace = ConversationTrace(conversation_id="conv_abc123", user_id="user_789")

for turn_num, ... in enumerate(agent_loop):
    response = client.messages.create(...)
    trace.record_turn(
        turn_num=turn_num,
        messages=messages,
        response_content=response.content,
        usage={"input": response.usage.input_tokens, "output": response.usage.output_tokens}
    )

trace_path = trace.save()

Layer 3: Cost Dashboard

Aggregate token usage by user and time window. Essential before bills arrive.

Simple daily cost aggregator

import json
import sqlite3
from datetime import date
from pathlib import Path


class CostTracker:
    """SQLite-backed cost tracker for per-user API spend."""

    def __init__(self, db_path: str = "/tmp/agent_costs.db"):
        self.conn = sqlite3.connect(db_path)
        self._init_db()

    def _init_db(self):
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS usage (
                date TEXT,
                user_id TEXT,
                model TEXT,
                input_tokens INTEGER,
                output_tokens INTEGER,
                cache_read_tokens INTEGER,
                cost_usd REAL,
                PRIMARY KEY (date, user_id, model)
            )
        """)
        self.conn.execute("""
            CREATE INDEX IF NOT EXISTS idx_user_date ON usage (user_id, date)
        """)
        self.conn.commit()

    def record(self, user_id: str, model: str, input_tokens: int,
               output_tokens: int, cache_read_tokens: int, cost_usd: float):
        today = date.today().isoformat()
        self.conn.execute("""
            INSERT INTO usage (date, user_id, model, input_tokens, output_tokens,
                               cache_read_tokens, cost_usd)
            VALUES (?, ?, ?, ?, ?, ?, ?)
            ON CONFLICT (date, user_id, model) DO UPDATE SET
                input_tokens = input_tokens + excluded.input_tokens,
                output_tokens = output_tokens + excluded.output_tokens,
                cache_read_tokens = cache_read_tokens + excluded.cache_read_tokens,
                cost_usd = cost_usd + excluded.cost_usd
        """, (today, user_id, model, input_tokens, output_tokens, cache_read_tokens, cost_usd))
        self.conn.commit()

    def get_daily_cost(self, user_id: str, days: int = 7) -> list:
        rows = self.conn.execute("""
            SELECT date, SUM(cost_usd) as total_cost,
                   SUM(input_tokens) as total_input,
                   SUM(output_tokens) as total_output
            FROM usage
            WHERE user_id = ?
              AND date >= date('now', ? || ' days')
            GROUP BY date
            ORDER BY date DESC
        """, (user_id, f"-{days}")).fetchall()
        return [{"date": r[0], "cost_usd": r[1], "input_tokens": r[2], "output_tokens": r[3]}
                for r in rows]

    def get_top_users_today(self, limit: int = 10) -> list:
        today = date.today().isoformat()
        rows = self.conn.execute("""
            SELECT user_id, SUM(cost_usd) as total_cost
            FROM usage
            WHERE date = ?
            GROUP BY user_id
            ORDER BY total_cost DESC
            LIMIT ?
        """, (today, limit)).fetchall()
        return [{"user_id": r[0], "cost_usd": r[1]} for r in rows]

Key Metrics to Track

Metric Why It Matters Alert Threshold
Cache hit rate Below 70% means caching not working < 60% on stable agents
Average turns per conversation High turns = possible loops or vague prompts > 15 turns avg
P95 latency User experience > 10s for sync calls
Cost per conversation Business unit economics > $0.50 for most use cases
Tool call failure rate Reliability signal > 5% failures
Error rate (429, 529) Rate limit / overload issues > 1%

Frequently Asked Questions

What should I log for every Claude API call? At minimum: trace/conversation ID, user ID, model, input tokens, output tokens, cache hit tokens, latency in ms, stop reason, tool calls made, and cost in USD. This gives you everything needed to debug a bad response and understand cost drivers.

How do I debug a bad agent response after the fact? With a conversation trace saved to disk, you can replay the exact messages sent and responses received. Look at: how many turns it took, which tools were called, and whether the context accumulated unexpected content.

What's a healthy prompt cache hit rate? For agents with consistent system prompts, expect 85-95% cache hit rate after the first few requests warm the cache. Below 60% suggests the cache_control isn't configured correctly or the system prompt is varying unexpectedly.

How do I set cost alerts? Track daily spend per user and send an alert when any user exceeds a daily threshold (e.g., $5/day). At the account level, set hard limits in the Anthropic Console under Workspace settings.

Should I use LangSmith or build my own observability? LangSmith is excellent for LangChain users and provides out-of-the-box trace visualization. For native Anthropic SDK users, the lightweight logging patterns above are sufficient for most products and avoid adding another dependency. Use LangSmith if you need advanced evaluation features or already use LangChain.


Related Guides


Go Deeper

Agent SDK Cookbook — $49 — Production observability patterns including the full logging stack, cost dashboard implementation, conversation replay tool, and alerting setup. Python and TypeScript versions included.

→ Get the Agent SDK Cookbook — $49

30-day money-back guarantee. Instant download.

AI Disclosure: Written with Claude Code; patterns tested in production agent deployments.

Tools and references