Claude Agent Observability: Logging, Tracing, and Debugging Production Agents
Production Claude agents need three observability layers: structured logging of every LLM call with token counts and latency, trace IDs that connect multi-turn conversations to individual requests, and a cost dashboard that shows per-user API spending before your bill arrives. Without these, debugging agent failures is guesswork and cost surprises are inevitable. This guide covers the full observability stack for production Claude agents, from structured logging to cost alerts.
Why Agent Observability Is Different
Standard web application observability tracks HTTP requests: status codes, latency, errors. This covers the surface of agent behavior but misses the most important signals:
- What did the agent actually do? (tool calls, reasoning steps)
- Why did it give a bad answer? (context, instructions, model version)
- How much did each user cost? (token usage by conversation)
- Is the agent looping? (turn count anomalies)
- Did prompt caching work? (cache hit rate by conversation type)
You need purpose-built agent observability on top of standard infrastructure monitoring.
Layer 1: Structured Logging
Every API call should emit a structured log event — not a print statement, a JSON record.
Python logging setup
import logging
import json
import time
import uuid
from dataclasses import dataclass, asdict
from typing import Any
import anthropic
# Configure structured logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("claude_agent")
@dataclass
class LLMCallEvent:
event_type: str = "llm_call"
trace_id: str = ""
session_id: str = ""
user_id: str = ""
model: str = ""
input_tokens: int = 0
output_tokens: int = 0
cache_read_tokens: int = 0
cache_write_tokens: int = 0
latency_ms: float = 0.0
stop_reason: str = ""
tool_calls: list = None
cost_usd: float = 0.0
error: str = ""
def __post_init__(self):
if self.tool_calls is None:
self.tool_calls = []
def calculate_cost(model: str, input_tokens: int, output_tokens: int,
cache_read_tokens: int = 0) -> float:
"""Calculate cost in USD."""
pricing = {
"claude-haiku-4-5": {"input": 0.80, "output": 4.00, "cache_read": 0.08},
"claude-sonnet-4-5": {"input": 3.00, "output": 15.00, "cache_read": 0.30},
"claude-opus-4-7": {"input": 15.00, "output": 75.00, "cache_read": 1.50},
}
p = pricing.get(model, pricing["claude-sonnet-4-5"])
billed_input = input_tokens - cache_read_tokens # Non-cached tokens
return (
billed_input * p["input"] / 1_000_000 +
cache_read_tokens * p["cache_read"] / 1_000_000 +
output_tokens * p["output"] / 1_000_000
)
class ObservableAnthropicClient:
"""Anthropic client wrapper that emits structured logs."""
def __init__(self, session_id: str = None, user_id: str = None):
self.client = anthropic.Anthropic()
self.session_id = session_id or str(uuid.uuid4())
self.user_id = user_id or "anonymous"
def create(self, trace_id: str = None, **kwargs) -> Any:
"""Create a message with full observability."""
trace_id = trace_id or str(uuid.uuid4())
event = LLMCallEvent(
trace_id=trace_id,
session_id=self.session_id,
user_id=self.user_id,
model=kwargs.get("model", "unknown"),
)
start = time.time()
try:
response = self.client.messages.create(**kwargs)
# Extract usage
usage = response.usage
event.input_tokens = usage.input_tokens
event.output_tokens = usage.output_tokens
event.cache_read_tokens = getattr(usage, "cache_read_input_tokens", 0)
event.cache_write_tokens = getattr(usage, "cache_creation_input_tokens", 0)
event.stop_reason = response.stop_reason or ""
# Extract tool calls
event.tool_calls = [
{"name": b.name, "input_keys": list(b.input.keys())}
for b in response.content if b.type == "tool_use"
]
# Calculate cost
event.cost_usd = calculate_cost(
event.model, event.input_tokens,
event.output_tokens, event.cache_read_tokens
)
except Exception as e:
event.error = f"{type(e).__name__}: {str(e)}"
raise
finally:
event.latency_ms = (time.time() - start) * 1000
logger.info(json.dumps(asdict(event)))
return response
Usage in an agent
client = ObservableAnthropicClient(
session_id="conv_abc123",
user_id="user_789"
)
response = client.create(
trace_id="turn_1",
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": "Summarize this report..."}]
)
Output log:
{
"event_type": "llm_call",
"trace_id": "turn_1",
"session_id": "conv_abc123",
"user_id": "user_789",
"model": "claude-sonnet-4-5",
"input_tokens": 1842,
"output_tokens": 312,
"cache_read_tokens": 1200,
"cache_write_tokens": 0,
"latency_ms": 2341.5,
"stop_reason": "end_turn",
"tool_calls": [],
"cost_usd": 0.004326,
"error": ""
}
Layer 2: Conversation Tracing
Link every turn in a conversation with a consistent conversation ID, so you can replay the full context when debugging.
import json
from pathlib import Path
class ConversationTrace:
"""Records a complete conversation for debugging."""
def __init__(self, conversation_id: str, user_id: str, log_dir: str = "/tmp/traces"):
self.conversation_id = conversation_id
self.user_id = user_id
self.log_dir = Path(log_dir)
self.log_dir.mkdir(exist_ok=True)
self.turns = []
def record_turn(self, turn_num: int, messages: list, response_content: list,
usage: dict, tool_results: list = None):
"""Record a single conversation turn."""
self.turns.append({
"turn": turn_num,
"messages_sent": len(messages),
"response_content_types": [b.get("type") if isinstance(b, dict) else b.type
for b in response_content],
"usage": usage,
"tool_calls": [b.name for b in response_content
if hasattr(b, "type") and b.type == "tool_use"],
"tool_results_count": len(tool_results) if tool_results else 0,
})
def save(self):
"""Persist the trace to disk."""
trace_path = self.log_dir / f"trace_{self.conversation_id}.json"
trace_path.write_text(json.dumps({
"conversation_id": self.conversation_id,
"user_id": self.user_id,
"total_turns": len(self.turns),
"turns": self.turns,
}, indent=2))
return str(trace_path)
# Usage in agent loop
trace = ConversationTrace(conversation_id="conv_abc123", user_id="user_789")
for turn_num, ... in enumerate(agent_loop):
response = client.messages.create(...)
trace.record_turn(
turn_num=turn_num,
messages=messages,
response_content=response.content,
usage={"input": response.usage.input_tokens, "output": response.usage.output_tokens}
)
trace_path = trace.save()
Layer 3: Cost Dashboard
Aggregate token usage by user and time window. Essential before bills arrive.
Simple daily cost aggregator
import json
import sqlite3
from datetime import date
from pathlib import Path
class CostTracker:
"""SQLite-backed cost tracker for per-user API spend."""
def __init__(self, db_path: str = "/tmp/agent_costs.db"):
self.conn = sqlite3.connect(db_path)
self._init_db()
def _init_db(self):
self.conn.execute("""
CREATE TABLE IF NOT EXISTS usage (
date TEXT,
user_id TEXT,
model TEXT,
input_tokens INTEGER,
output_tokens INTEGER,
cache_read_tokens INTEGER,
cost_usd REAL,
PRIMARY KEY (date, user_id, model)
)
""")
self.conn.execute("""
CREATE INDEX IF NOT EXISTS idx_user_date ON usage (user_id, date)
""")
self.conn.commit()
def record(self, user_id: str, model: str, input_tokens: int,
output_tokens: int, cache_read_tokens: int, cost_usd: float):
today = date.today().isoformat()
self.conn.execute("""
INSERT INTO usage (date, user_id, model, input_tokens, output_tokens,
cache_read_tokens, cost_usd)
VALUES (?, ?, ?, ?, ?, ?, ?)
ON CONFLICT (date, user_id, model) DO UPDATE SET
input_tokens = input_tokens + excluded.input_tokens,
output_tokens = output_tokens + excluded.output_tokens,
cache_read_tokens = cache_read_tokens + excluded.cache_read_tokens,
cost_usd = cost_usd + excluded.cost_usd
""", (today, user_id, model, input_tokens, output_tokens, cache_read_tokens, cost_usd))
self.conn.commit()
def get_daily_cost(self, user_id: str, days: int = 7) -> list:
rows = self.conn.execute("""
SELECT date, SUM(cost_usd) as total_cost,
SUM(input_tokens) as total_input,
SUM(output_tokens) as total_output
FROM usage
WHERE user_id = ?
AND date >= date('now', ? || ' days')
GROUP BY date
ORDER BY date DESC
""", (user_id, f"-{days}")).fetchall()
return [{"date": r[0], "cost_usd": r[1], "input_tokens": r[2], "output_tokens": r[3]}
for r in rows]
def get_top_users_today(self, limit: int = 10) -> list:
today = date.today().isoformat()
rows = self.conn.execute("""
SELECT user_id, SUM(cost_usd) as total_cost
FROM usage
WHERE date = ?
GROUP BY user_id
ORDER BY total_cost DESC
LIMIT ?
""", (today, limit)).fetchall()
return [{"user_id": r[0], "cost_usd": r[1]} for r in rows]
Key Metrics to Track
| Metric | Why It Matters | Alert Threshold |
|---|---|---|
| Cache hit rate | Below 70% means caching not working | < 60% on stable agents |
| Average turns per conversation | High turns = possible loops or vague prompts | > 15 turns avg |
| P95 latency | User experience | > 10s for sync calls |
| Cost per conversation | Business unit economics | > $0.50 for most use cases |
| Tool call failure rate | Reliability signal | > 5% failures |
| Error rate (429, 529) | Rate limit / overload issues | > 1% |
Frequently Asked Questions
What should I log for every Claude API call? At minimum: trace/conversation ID, user ID, model, input tokens, output tokens, cache hit tokens, latency in ms, stop reason, tool calls made, and cost in USD. This gives you everything needed to debug a bad response and understand cost drivers.
How do I debug a bad agent response after the fact? With a conversation trace saved to disk, you can replay the exact messages sent and responses received. Look at: how many turns it took, which tools were called, and whether the context accumulated unexpected content.
What's a healthy prompt cache hit rate? For agents with consistent system prompts, expect 85-95% cache hit rate after the first few requests warm the cache. Below 60% suggests the cache_control isn't configured correctly or the system prompt is varying unexpectedly.
How do I set cost alerts? Track daily spend per user and send an alert when any user exceeds a daily threshold (e.g., $5/day). At the account level, set hard limits in the Anthropic Console under Workspace settings.
Should I use LangSmith or build my own observability? LangSmith is excellent for LangChain users and provides out-of-the-box trace visualization. For native Anthropic SDK users, the lightweight logging patterns above are sufficient for most products and avoid adding another dependency. Use LangSmith if you need advanced evaluation features or already use LangChain.
Related Guides
- How to Handle Errors and Retries in Claude Agent SDK — Error handling patterns
- Claude Agent SDK: Build Automation Agents — Full SDK guide
- Token Counting: Why Your Estimates Are Wrong — Token math
Go Deeper
Agent SDK Cookbook — $49 — Production observability patterns including the full logging stack, cost dashboard implementation, conversation replay tool, and alerting setup. Python and TypeScript versions included.
→ Get the Agent SDK Cookbook — $49
30-day money-back guarantee. Instant download.