Testing and Evaluating Claude Agents: A Production Guide
Most Claude agents ship without any automated tests — and most teams regret it after a prompt change silently breaks a production workflow. A complete agent testing strategy has three layers: unit tests for tool call logic, integration tests for multi-turn conversation flows, and an eval harness that measures output quality on a fixed dataset before every deploy. This guide covers the full testing stack for production Claude agents.
Why Agent Testing Is Different
Standard software testing verifies deterministic behavior: input A always produces output B. Agent testing has a different challenge: LLM outputs are probabilistic. You can't assert exact string equality — you need to assert properties of the output.
The testing hierarchy for agents:
- Unit tests: Test your tool implementations independently (deterministic, easy)
- Integration tests: Test the full agent loop with mocked or real API calls (semi-deterministic)
- Eval harness: Measure output quality on a representative dataset (probabilistic, scored)
- Regression tests: Run the eval before every deploy, alert on quality drops (ongoing)
Layer 1: Unit Testing Tool Implementations
Tool implementations are regular functions — test them like any other code.
import pytest
from unittest.mock import patch, MagicMock
from your_agent.tools import search_database, format_invoice, validate_input
class TestSearchDatabaseTool:
"""Test the tool implementation independently of Claude."""
def test_returns_results_for_valid_query(self):
with patch("your_agent.tools.db") as mock_db:
mock_db.execute.return_value = [
{"id": 1, "name": "Test User", "email": "test@example.com"}
]
result = search_database(query="test", limit=10)
assert len(result) == 1
assert result[0]["name"] == "Test User"
def test_returns_empty_list_for_no_results(self):
with patch("your_agent.tools.db") as mock_db:
mock_db.execute.return_value = []
result = search_database(query="nonexistent", limit=10)
assert result == []
def test_raises_on_invalid_limit(self):
with pytest.raises(ValueError, match="limit must be positive"):
search_database(query="test", limit=-1)
def test_sanitizes_sql_injection_attempt(self):
"""Tool should handle malicious input gracefully."""
result = search_database(query="'; DROP TABLE users; --", limit=10)
# Should not raise, should return empty or sanitized results
assert isinstance(result, list)
class TestFormatInvoiceTool:
def test_formats_standard_invoice(self):
invoice_data = {
"vendor": "Acme Corp",
"amount": 1500.00,
"date": "2026-04-28",
"items": [{"description": "Consulting", "qty": 10, "price": 150.0}]
}
result = format_invoice(invoice_data)
assert "Acme Corp" in result
assert "$1,500.00" in result or "1500" in result
def test_handles_missing_optional_fields(self):
minimal_invoice = {"vendor": "Test", "amount": 100.0, "date": "2026-04-28"}
# Should not raise
result = format_invoice(minimal_invoice)
assert result is not None
Layer 2: Integration Tests for the Agent Loop
Integration tests verify that the agent orchestrates tools correctly across a multi-turn conversation. Use recorded responses or a mock client to make tests deterministic.
Approach A: Mock the Anthropic client
import anthropic
from unittest.mock import MagicMock, patch
from your_agent.agent import run_agent
def make_mock_response(text=None, tool_name=None, tool_input=None, stop_reason="end_turn"):
"""Build a mock anthropic.Message object."""
response = MagicMock()
response.stop_reason = stop_reason
response.usage = MagicMock(input_tokens=100, output_tokens=50)
if tool_name:
tool_block = MagicMock()
tool_block.type = "tool_use"
tool_block.name = tool_name
tool_block.id = "tool_abc123"
tool_block.input = tool_input or {}
response.content = [tool_block]
else:
text_block = MagicMock()
text_block.type = "text"
text_block.text = text or "Done."
response.content = [text_block]
return response
class TestAgentOrchestration:
@patch("your_agent.agent.anthropic.Anthropic")
def test_agent_calls_search_tool_when_asked(self, mock_anthropic_class):
"""Agent should call search_database tool for search requests."""
mock_client = MagicMock()
mock_anthropic_class.return_value = mock_client
# First call: Claude decides to use search tool
# Second call: Claude synthesizes the result
mock_client.messages.create.side_effect = [
make_mock_response(
tool_name="search_database",
tool_input={"query": "active users", "limit": 10},
stop_reason="tool_use"
),
make_mock_response(text="I found 3 active users matching your query."),
]
with patch("your_agent.agent.search_database") as mock_search:
mock_search.return_value = [
{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}, {"id": 3, "name": "Carol"}
]
result = run_agent("Find active users")
# Verify search was called
mock_search.assert_called_once_with(query="active users", limit=10)
assert "found" in result.lower() or "3" in result
@patch("your_agent.agent.anthropic.Anthropic")
def test_agent_handles_tool_error_gracefully(self, mock_anthropic_class):
"""Agent should recover when a tool raises an exception."""
mock_client = MagicMock()
mock_anthropic_class.return_value = mock_client
mock_client.messages.create.side_effect = [
make_mock_response(
tool_name="search_database",
tool_input={"query": "test"},
stop_reason="tool_use"
),
make_mock_response(text="I wasn't able to search the database. Please try again."),
]
with patch("your_agent.agent.search_database") as mock_search:
mock_search.side_effect = Exception("Database connection failed")
result = run_agent("Search for test")
# Agent should respond gracefully, not crash
assert result is not None
assert isinstance(result, str)
@patch("your_agent.agent.anthropic.Anthropic")
def test_agent_stops_before_turn_limit(self, mock_anthropic_class):
"""Agent should not loop indefinitely."""
mock_client = MagicMock()
mock_anthropic_class.return_value = mock_client
# Return tool_use indefinitely
mock_client.messages.create.return_value = make_mock_response(
tool_name="search_database",
tool_input={"query": "loop"},
stop_reason="tool_use"
)
with patch("your_agent.agent.search_database", return_value=[]):
result = run_agent("Keep searching", max_turns=5)
# Should stop at max_turns, not loop forever
assert mock_client.messages.create.call_count <= 5
Approach B: Record and replay real API responses
import json
from pathlib import Path
class RecordedAnthropicClient:
"""Client that records real API calls and can replay them."""
def __init__(self, record_path: str, mode: str = "replay"):
self.record_path = Path(record_path)
self.mode = mode # "record" or "replay"
self._calls = []
self._index = 0
if mode == "replay" and self.record_path.exists():
self._calls = json.loads(self.record_path.read_text())
def messages_create(self, **kwargs):
if self.mode == "record":
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(**kwargs)
self._calls.append(response.model_dump())
return response
else:
if self._index >= len(self._calls):
raise RuntimeError(f"No more recorded responses. Index {self._index}, have {len(self._calls)}")
call_data = self._calls[self._index]
self._index += 1
# Reconstruct response object from recorded data
return anthropic.types.Message(**call_data)
def save_recording(self):
self.record_path.write_text(json.dumps(self._calls, indent=2))
Layer 3: Eval Harness for Output Quality
An eval harness runs your agent against a fixed set of test cases with known correct answers, scores each output, and tracks quality over time.
import json
import re
from dataclasses import dataclass, field
from typing import Callable
from pathlib import Path
@dataclass
class EvalCase:
"""A single evaluation test case."""
id: str
input: str
expected_properties: dict # Properties the output should have
context: dict = field(default_factory=dict) # Optional extra context
@dataclass
class EvalResult:
case_id: str
output: str
scores: dict
passed: bool
failure_reasons: list
class AgentEvalHarness:
"""Run eval cases and score agent outputs."""
def __init__(self, agent_fn: Callable[[str], str]):
self.agent_fn = agent_fn
self.results = []
def run(self, cases: list[EvalCase]) -> dict:
"""Run all eval cases and return aggregate scores."""
self.results = []
for case in cases:
try:
output = self.agent_fn(case.input)
result = self._score(case, output)
except Exception as e:
result = EvalResult(
case_id=case.id,
output=f"ERROR: {e}",
scores={},
passed=False,
failure_reasons=[f"Agent raised exception: {e}"]
)
self.results.append(result)
return self._aggregate()
def _score(self, case: EvalCase, output: str) -> EvalResult:
scores = {}
failures = []
for check_name, check_value in case.expected_properties.items():
if check_name == "contains":
# Output must contain all specified strings
for phrase in check_value:
if phrase.lower() not in output.lower():
failures.append(f"Missing '{phrase}' in output")
scores["contains"] = len(failures) == 0
elif check_name == "not_contains":
for phrase in check_value:
if phrase.lower() in output.lower():
failures.append(f"Output contains forbidden phrase: '{phrase}'")
scores["not_contains"] = not any(
p.lower() in output.lower() for p in check_value
)
elif check_name == "min_length":
passes = len(output) >= check_value
if not passes:
failures.append(f"Output too short: {len(output)} < {check_value}")
scores["min_length"] = passes
elif check_name == "max_length":
passes = len(output) <= check_value
if not passes:
failures.append(f"Output too long: {len(output)} > {check_value}")
scores["max_length"] = passes
elif check_name == "json_valid":
try:
json.loads(output)
scores["json_valid"] = True
except json.JSONDecodeError:
failures.append("Output is not valid JSON")
scores["json_valid"] = False
elif check_name == "regex":
matches = bool(re.search(check_value, output, re.IGNORECASE))
if not matches:
failures.append(f"Output does not match regex: {check_value}")
scores["regex"] = matches
passed = len(failures) == 0
return EvalResult(
case_id=case.id,
output=output,
scores=scores,
passed=passed,
failure_reasons=failures
)
def _aggregate(self) -> dict:
total = len(self.results)
passed = sum(1 for r in self.results if r.passed)
return {
"total": total,
"passed": passed,
"failed": total - passed,
"pass_rate": passed / total if total > 0 else 0,
"results": [
{
"id": r.case_id,
"passed": r.passed,
"failures": r.failure_reasons
}
for r in self.results
]
}
# Example eval dataset
INVOICE_AGENT_EVALS = [
EvalCase(
id="basic_extraction",
input="Extract data from: Invoice #1234 from Acme Corp, $500 due 2026-05-01",
expected_properties={
"contains": ["Acme Corp", "500", "1234"],
"json_valid": True,
}
),
EvalCase(
id="missing_amount",
input="Extract: Invoice from Beta LLC dated today",
expected_properties={
"contains": ["Beta LLC"],
"not_contains": ["$0", "amount: null"],
}
),
EvalCase(
id="multi_item_invoice",
input="Parse: Invoice #9999, Items: Widget A x2 @$25, Widget B x1 @$50. Total $100.",
expected_properties={
"contains": ["100", "9999"],
"min_length": 100,
}
),
]
Layer 4: Regression Testing in CI
Run the eval harness before every deploy to catch prompt regressions.
# scripts/run_evals.py — run in CI before deployment
import sys
import json
from your_agent.agent import run_agent
from your_agent.evals import INVOICE_AGENT_EVALS, AgentEvalHarness
PASS_THRESHOLD = 0.85 # Fail CI if pass rate drops below 85%
def main():
harness = AgentEvalHarness(agent_fn=run_agent)
results = harness.run(INVOICE_AGENT_EVALS)
print(f"\nEval Results: {results['passed']}/{results['total']} passed ({results['pass_rate']:.1%})")
for r in results["results"]:
status = "✅" if r["passed"] else "❌"
print(f" {status} {r['id']}")
if r["failures"]:
for f in r["failures"]:
print(f" ↳ {f}")
if results["pass_rate"] < PASS_THRESHOLD:
print(f"\nFAILED: pass rate {results['pass_rate']:.1%} < threshold {PASS_THRESHOLD:.1%}")
sys.exit(1)
else:
print(f"\nPASSED: pass rate {results['pass_rate']:.1%} >= threshold {PASS_THRESHOLD:.1%}")
sys.exit(0)
if __name__ == "__main__":
main()
# .github/workflows/eval.yml
name: Agent Eval
on:
pull_request:
paths:
- 'your_agent/**'
- 'prompts/**'
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install -r requirements.txt
- run: python scripts/run_evals.py
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Eval Dataset Management
Your eval dataset is a first-class artifact — version it, grow it, and review it.
Rules for a good eval dataset:
- 20-50 cases for MVP: Enough to detect regressions without being expensive to run
- Represent failure modes: Include edge cases your agent has failed on in production
- Cover the distribution: Easy, medium, and hard cases in proportion to production traffic
- Freeze, don't tweak: When a case fails, fix the agent — not the eval
- Grow from production: When a user reports a bad response, add it as an eval case
# Add a production failure as an eval case
def add_eval_from_production_failure(
user_input: str,
bad_output: str,
expected_properties: dict,
case_id: str,
eval_file: str = "evals/cases.json"
):
"""Record a production failure as a regression test."""
new_case = {
"id": case_id,
"input": user_input,
"expected_properties": expected_properties,
"added_from": "production",
"example_bad_output": bad_output[:200] # For reference
}
path = Path(eval_file)
cases = json.loads(path.read_text()) if path.exists() else []
cases.append(new_case)
path.write_text(json.dumps(cases, indent=2))
print(f"Added eval case: {case_id}")
Frequently Asked Questions
Do I need to test Claude agents if I'm not changing the underlying model? Yes — prompts change, tools change, and system prompts drift. Any change to the agent's inputs can change outputs. Regression tests catch prompt regressions that would otherwise only surface in production.
Should I use real API calls in my eval harness or mock responses? Both. Unit tests and integration tests should use mocks for speed and reliability. The eval harness should use real API calls to measure actual output quality — mocking defeats the purpose of evaluating model behavior.
How many eval cases do I need? Start with 20-30 cases covering your main use cases and known failure modes. Add cases from production failures as they occur. At 50+ cases with 85%+ pass rate, you have solid regression coverage.
What's the right pass threshold? Start at 80%. After your first month of running evals, raise it toward 90% as you understand where the model reliably succeeds. Don't set it at 100% — LLM outputs are probabilistic and minor variation is expected.
How do I evaluate subjective output quality (tone, helpfulness)? Use a grader — a second Claude call that rates the output on a 1-5 scale with structured criteria. This is called LLM-as-judge evaluation. It adds cost but scales to subjective quality checks.
Related Guides
- How to Handle Errors and Retries in Claude Agent SDK — Error handling
- Claude Agent Observability — Logging and tracing
- Claude Agent SDK: Build Automation Agents — Full SDK guide
Go Deeper
Agent SDK Cookbook — $49 — Full testing infrastructure: eval harness with LLM-as-judge scoring, CI integration templates, recorded/replay client for deterministic integration tests, and 30 production eval datasets across common agent types.
→ Get the Agent SDK Cookbook — $49
30-day money-back guarantee. Instant download.