Claude Agent Observability: Logging, Tracing, and Debugging Production Agents

Production Claude agents need three observability layers: structured logging of every LLM call with token counts and latency, trace IDs that connect multi-turn conversations to individual requests, and a cost dashboard that shows per-user API spending before your bill arrives. Without these, debugging agent failures is guesswork and cost surprises are inevitable. This guide covers the full observability stack for production Claude agents, from structured logging to cost alerts.

Why Agent Observability Is Different

Standard web application observability tracks HTTP requests: status codes, latency, errors. This covers the surface of agent behavior but misses the most important signals:

What did the agent actually do? (tool calls, reasoning steps)
Why did it give a bad answer? (context, instructions, model version)
How much did each user cost? (token usage by conversation)
Is the agent looping? (turn count anomalies)
Did prompt caching work? (cache hit rate by conversation type)

You need purpose-built agent observability on top of standard infrastructure monitoring.

Layer 1: Structured Logging

Every API call should emit a structured log event — not a print statement, a JSON record.

Python logging setup

import logging
import json
import time
import uuid
from dataclasses import dataclass, asdict
from typing import Any
import anthropic

# Configure structured logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("claude_agent")


@dataclass
class LLMCallEvent:
    event_type: str = "llm_call"
    trace_id: str = ""
    session_id: str = ""
    user_id: str = ""
    model: str = ""
    input_tokens: int = 0
    output_tokens: int = 0
    cache_read_tokens: int = 0
    cache_write_tokens: int = 0
    latency_ms: float = 0.0
    stop_reason: str = ""
    tool_calls: list = None
    cost_usd: float = 0.0
    error: str = ""

    def __post_init__(self):
        if self.tool_calls is None:
            self.tool_calls = []


def calculate_cost(model: str, input_tokens: int, output_tokens: int,
                   cache_read_tokens: int = 0) -> float:
    """Calculate cost in USD."""
    pricing = {
        "claude-haiku-4-5":  {"input": 0.80, "output": 4.00, "cache_read": 0.08},
        "claude-sonnet-4-5": {"input": 3.00, "output": 15.00, "cache_read": 0.30},
        "claude-opus-4-7":   {"input": 15.00, "output": 75.00, "cache_read": 1.50},
    }
    p = pricing.get(model, pricing["claude-sonnet-4-5"])
    billed_input = input_tokens - cache_read_tokens  # Non-cached tokens
    return (
        billed_input * p["input"] / 1_000_000 +
        cache_read_tokens * p["cache_read"] / 1_000_000 +
        output_tokens * p["output"] / 1_000_000
    )


class ObservableAnthropicClient:
    """Anthropic client wrapper that emits structured logs."""

    def __init__(self, session_id: str = None, user_id: str = None):
        self.client = anthropic.Anthropic()
        self.session_id = session_id or str(uuid.uuid4())
        self.user_id = user_id or "anonymous"

    def create(self, trace_id: str = None, **kwargs) -> Any:
        """Create a message with full observability."""
        trace_id = trace_id or str(uuid.uuid4())
        event = LLMCallEvent(
            trace_id=trace_id,
            session_id=self.session_id,
            user_id=self.user_id,
            model=kwargs.get("model", "unknown"),
        )

        start = time.time()
        try:
            response = self.client.messages.create(**kwargs)

            # Extract usage
            usage = response.usage
            event.input_tokens = usage.input_tokens
            event.output_tokens = usage.output_tokens
            event.cache_read_tokens = getattr(usage, "cache_read_input_tokens", 0)
            event.cache_write_tokens = getattr(usage, "cache_creation_input_tokens", 0)
            event.stop_reason = response.stop_reason or ""

            # Extract tool calls
            event.tool_calls = [
                {"name": b.name, "input_keys": list(b.input.keys())}
                for b in response.content if b.type == "tool_use"
            ]

            # Calculate cost
            event.cost_usd = calculate_cost(
                event.model, event.input_tokens,
                event.output_tokens, event.cache_read_tokens
            )

        except Exception as e:
            event.error = f"{type(e).__name__}: {str(e)}"
            raise
        finally:
            event.latency_ms = (time.time() - start) * 1000
            logger.info(json.dumps(asdict(event)))

        return response

Usage in an agent

client = ObservableAnthropicClient(
    session_id="conv_abc123",
    user_id="user_789"
)

response = client.create(
    trace_id="turn_1",
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Summarize this report..."}]
)

Output log:

{
  "event_type": "llm_call",
  "trace_id": "turn_1",
  "session_id": "conv_abc123",
  "user_id": "user_789",
  "model": "claude-sonnet-4-5",
  "input_tokens": 1842,
  "output_tokens": 312,
  "cache_read_tokens": 1200,
  "cache_write_tokens": 0,
  "latency_ms": 2341.5,
  "stop_reason": "end_turn",
  "tool_calls": [],
  "cost_usd": 0.004326,
  "error": ""
}

Layer 2: Conversation Tracing

Link every turn in a conversation with a consistent conversation ID, so you can replay the full context when debugging.

import json
from pathlib import Path


class ConversationTrace:
    """Records a complete conversation for debugging."""

    def __init__(self, conversation_id: str, user_id: str, log_dir: str = "/tmp/traces"):
        self.conversation_id = conversation_id
        self.user_id = user_id
        self.log_dir = Path(log_dir)
        self.log_dir.mkdir(exist_ok=True)
        self.turns = []

    def record_turn(self, turn_num: int, messages: list, response_content: list,
                    usage: dict, tool_results: list = None):
        """Record a single conversation turn."""
        self.turns.append({
            "turn": turn_num,
            "messages_sent": len(messages),
            "response_content_types": [b.get("type") if isinstance(b, dict) else b.type
                                        for b in response_content],
            "usage": usage,
            "tool_calls": [b.name for b in response_content
                           if hasattr(b, "type") and b.type == "tool_use"],
            "tool_results_count": len(tool_results) if tool_results else 0,
        })

    def save(self):
        """Persist the trace to disk."""
        trace_path = self.log_dir / f"trace_{self.conversation_id}.json"
        trace_path.write_text(json.dumps({
            "conversation_id": self.conversation_id,
            "user_id": self.user_id,
            "total_turns": len(self.turns),
            "turns": self.turns,
        }, indent=2))
        return str(trace_path)


# Usage in agent loop
trace = ConversationTrace(conversation_id="conv_abc123", user_id="user_789")

for turn_num, ... in enumerate(agent_loop):
    response = client.messages.create(...)
    trace.record_turn(
        turn_num=turn_num,
        messages=messages,
        response_content=response.content,
        usage={"input": response.usage.input_tokens, "output": response.usage.output_tokens}
    )

trace_path = trace.save()

Layer 3: Cost Dashboard

Aggregate token usage by user and time window. Essential before bills arrive.

Simple daily cost aggregator

import json
import sqlite3
from datetime import date
from pathlib import Path


class CostTracker:
    """SQLite-backed cost tracker for per-user API spend."""

    def __init__(self, db_path: str = "/tmp/agent_costs.db"):
        self.conn = sqlite3.connect(db_path)
        self._init_db()

    def _init_db(self):
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS usage (
                date TEXT,
                user_id TEXT,
                model TEXT,
                input_tokens INTEGER,
                output_tokens INTEGER,
                cache_read_tokens INTEGER,
                cost_usd REAL,
                PRIMARY KEY (date, user_id, model)
            )
        """)
        self.conn.execute("""
            CREATE INDEX IF NOT EXISTS idx_user_date ON usage (user_id, date)
        """)
        self.conn.commit()

    def record(self, user_id: str, model: str, input_tokens: int,
               output_tokens: int, cache_read_tokens: int, cost_usd: float):
        today = date.today().isoformat()
        self.conn.execute("""
            INSERT INTO usage (date, user_id, model, input_tokens, output_tokens,
                               cache_read_tokens, cost_usd)
            VALUES (?, ?, ?, ?, ?, ?, ?)
            ON CONFLICT (date, user_id, model) DO UPDATE SET
                input_tokens = input_tokens + excluded.input_tokens,
                output_tokens = output_tokens + excluded.output_tokens,
                cache_read_tokens = cache_read_tokens + excluded.cache_read_tokens,
                cost_usd = cost_usd + excluded.cost_usd
        """, (today, user_id, model, input_tokens, output_tokens, cache_read_tokens, cost_usd))
        self.conn.commit()

    def get_daily_cost(self, user_id: str, days: int = 7) -> list:
        rows = self.conn.execute("""
            SELECT date, SUM(cost_usd) as total_cost,
                   SUM(input_tokens) as total_input,
                   SUM(output_tokens) as total_output
            FROM usage
            WHERE user_id = ?
              AND date >= date('now', ? || ' days')
            GROUP BY date
            ORDER BY date DESC
        """, (user_id, f"-{days}")).fetchall()
        return [{"date": r[0], "cost_usd": r[1], "input_tokens": r[2], "output_tokens": r[3]}
                for r in rows]

    def get_top_users_today(self, limit: int = 10) -> list:
        today = date.today().isoformat()
        rows = self.conn.execute("""
            SELECT user_id, SUM(cost_usd) as total_cost
            FROM usage
            WHERE date = ?
            GROUP BY user_id
            ORDER BY total_cost DESC
            LIMIT ?
        """, (today, limit)).fetchall()
        return [{"user_id": r[0], "cost_usd": r[1]} for r in rows]

Key Metrics to Track

Metric	Why It Matters	Alert Threshold
Cache hit rate	Below 70% means caching not working	< 60% on stable agents
Average turns per conversation	High turns = possible loops or vague prompts	> 15 turns avg
P95 latency	User experience	> 10s for sync calls
Cost per conversation	Business unit economics	> $0.50 for most use cases
Tool call failure rate	Reliability signal	> 5% failures
Error rate (429, 529)	Rate limit / overload issues	> 1%

Frequently Asked Questions

What should I log for every Claude API call? At minimum: trace/conversation ID, user ID, model, input tokens, output tokens, cache hit tokens, latency in ms, stop reason, tool calls made, and cost in USD. This gives you everything needed to debug a bad response and understand cost drivers.

How do I debug a bad agent response after the fact? With a conversation trace saved to disk, you can replay the exact messages sent and responses received. Look at: how many turns it took, which tools were called, and whether the context accumulated unexpected content.

What's a healthy prompt cache hit rate? For agents with consistent system prompts, expect 85-95% cache hit rate after the first few requests warm the cache. Below 60% suggests the cache_control isn't configured correctly or the system prompt is varying unexpectedly.

How do I set cost alerts? Track daily spend per user and send an alert when any user exceeds a daily threshold (e.g., $5/day). At the account level, set hard limits in the Anthropic Console under Workspace settings.

Should I use LangSmith or build my own observability? LangSmith is excellent for LangChain users and provides out-of-the-box trace visualization. For native Anthropic SDK users, the lightweight logging patterns above are sufficient for most products and avoid adding another dependency. Use LangSmith if you need advanced evaluation features or already use LangChain.

Related Guides

How to Handle Errors and Retries in Claude Agent SDK — Error handling patterns
Claude Agent SDK: Build Automation Agents — Full SDK guide
Token Counting: Why Your Estimates Are Wrong — Token math

Go Deeper

Agent SDK Cookbook — $49 — Production observability patterns including the full logging stack, cost dashboard implementation, conversation replay tool, and alerting setup. Python and TypeScript versions included.

→ Get the Agent SDK Cookbook — $49

30-day money-back guarantee. Instant download.

Claude Agent Observability: Logging, Tracing, and Debugging Production Agents

Why Agent Observability Is Different

Layer 1: Structured Logging

Python logging setup

Usage in an agent

Layer 2: Conversation Tracing

Layer 3: Cost Dashboard

Simple daily cost aggregator

Key Metrics to Track

Frequently Asked Questions

Related Guides

Go Deeper

Related guides

Streaming vs Batch in Claude Agent SDK: When to Use Which

Claude Agents for DevOps: Monitoring, Alerting, and Automated Remediation

Testing and Evaluating Claude Agents: A Production Guide

Deploying Claude Agents to Production: Fly.io, Vercel, and Lambda

Tools and references