← All guides

Testing and Evaluating Claude Agents: A Production Guide

How to test and evaluate Claude agents before shipping to production — unit tests for tool calls, integration tests for multi-turn conversations, eval.

Testing and Evaluating Claude Agents: A Production Guide

Most Claude agents ship without any automated tests — and most teams regret it after a prompt change silently breaks a production workflow. A complete agent testing strategy has three layers: unit tests for tool call logic, integration tests for multi-turn conversation flows, and an eval harness that measures output quality on a fixed dataset before every deploy. This guide covers the full testing stack for production Claude agents.


Why Agent Testing Is Different

Standard software testing verifies deterministic behavior: input A always produces output B. Agent testing has a different challenge: LLM outputs are probabilistic. You can't assert exact string equality — you need to assert properties of the output.

The testing hierarchy for agents:

  1. Unit tests: Test your tool implementations independently (deterministic, easy)
  2. Integration tests: Test the full agent loop with mocked or real API calls (semi-deterministic)
  3. Eval harness: Measure output quality on a representative dataset (probabilistic, scored)
  4. Regression tests: Run the eval before every deploy, alert on quality drops (ongoing)

Layer 1: Unit Testing Tool Implementations

Tool implementations are regular functions — test them like any other code.

import pytest
from unittest.mock import patch, MagicMock
from your_agent.tools import search_database, format_invoice, validate_input


class TestSearchDatabaseTool:
    """Test the tool implementation independently of Claude."""

    def test_returns_results_for_valid_query(self):
        with patch("your_agent.tools.db") as mock_db:
            mock_db.execute.return_value = [
                {"id": 1, "name": "Test User", "email": "test@example.com"}
            ]
            result = search_database(query="test", limit=10)

        assert len(result) == 1
        assert result[0]["name"] == "Test User"

    def test_returns_empty_list_for_no_results(self):
        with patch("your_agent.tools.db") as mock_db:
            mock_db.execute.return_value = []
            result = search_database(query="nonexistent", limit=10)

        assert result == []

    def test_raises_on_invalid_limit(self):
        with pytest.raises(ValueError, match="limit must be positive"):
            search_database(query="test", limit=-1)

    def test_sanitizes_sql_injection_attempt(self):
        """Tool should handle malicious input gracefully."""
        result = search_database(query="'; DROP TABLE users; --", limit=10)
        # Should not raise, should return empty or sanitized results
        assert isinstance(result, list)


class TestFormatInvoiceTool:
    def test_formats_standard_invoice(self):
        invoice_data = {
            "vendor": "Acme Corp",
            "amount": 1500.00,
            "date": "2026-04-28",
            "items": [{"description": "Consulting", "qty": 10, "price": 150.0}]
        }
        result = format_invoice(invoice_data)

        assert "Acme Corp" in result
        assert "$1,500.00" in result or "1500" in result

    def test_handles_missing_optional_fields(self):
        minimal_invoice = {"vendor": "Test", "amount": 100.0, "date": "2026-04-28"}
        # Should not raise
        result = format_invoice(minimal_invoice)
        assert result is not None

Layer 2: Integration Tests for the Agent Loop

Integration tests verify that the agent orchestrates tools correctly across a multi-turn conversation. Use recorded responses or a mock client to make tests deterministic.

Approach A: Mock the Anthropic client

import anthropic
from unittest.mock import MagicMock, patch
from your_agent.agent import run_agent


def make_mock_response(text=None, tool_name=None, tool_input=None, stop_reason="end_turn"):
    """Build a mock anthropic.Message object."""
    response = MagicMock()
    response.stop_reason = stop_reason
    response.usage = MagicMock(input_tokens=100, output_tokens=50)

    if tool_name:
        tool_block = MagicMock()
        tool_block.type = "tool_use"
        tool_block.name = tool_name
        tool_block.id = "tool_abc123"
        tool_block.input = tool_input or {}
        response.content = [tool_block]
    else:
        text_block = MagicMock()
        text_block.type = "text"
        text_block.text = text or "Done."
        response.content = [text_block]

    return response


class TestAgentOrchestration:

    @patch("your_agent.agent.anthropic.Anthropic")
    def test_agent_calls_search_tool_when_asked(self, mock_anthropic_class):
        """Agent should call search_database tool for search requests."""
        mock_client = MagicMock()
        mock_anthropic_class.return_value = mock_client

        # First call: Claude decides to use search tool
        # Second call: Claude synthesizes the result
        mock_client.messages.create.side_effect = [
            make_mock_response(
                tool_name="search_database",
                tool_input={"query": "active users", "limit": 10},
                stop_reason="tool_use"
            ),
            make_mock_response(text="I found 3 active users matching your query."),
        ]

        with patch("your_agent.agent.search_database") as mock_search:
            mock_search.return_value = [
                {"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}, {"id": 3, "name": "Carol"}
            ]
            result = run_agent("Find active users")

        # Verify search was called
        mock_search.assert_called_once_with(query="active users", limit=10)
        assert "found" in result.lower() or "3" in result

    @patch("your_agent.agent.anthropic.Anthropic")
    def test_agent_handles_tool_error_gracefully(self, mock_anthropic_class):
        """Agent should recover when a tool raises an exception."""
        mock_client = MagicMock()
        mock_anthropic_class.return_value = mock_client

        mock_client.messages.create.side_effect = [
            make_mock_response(
                tool_name="search_database",
                tool_input={"query": "test"},
                stop_reason="tool_use"
            ),
            make_mock_response(text="I wasn't able to search the database. Please try again."),
        ]

        with patch("your_agent.agent.search_database") as mock_search:
            mock_search.side_effect = Exception("Database connection failed")
            result = run_agent("Search for test")

        # Agent should respond gracefully, not crash
        assert result is not None
        assert isinstance(result, str)

    @patch("your_agent.agent.anthropic.Anthropic")
    def test_agent_stops_before_turn_limit(self, mock_anthropic_class):
        """Agent should not loop indefinitely."""
        mock_client = MagicMock()
        mock_anthropic_class.return_value = mock_client

        # Return tool_use indefinitely
        mock_client.messages.create.return_value = make_mock_response(
            tool_name="search_database",
            tool_input={"query": "loop"},
            stop_reason="tool_use"
        )

        with patch("your_agent.agent.search_database", return_value=[]):
            result = run_agent("Keep searching", max_turns=5)

        # Should stop at max_turns, not loop forever
        assert mock_client.messages.create.call_count <= 5

Approach B: Record and replay real API responses

import json
from pathlib import Path


class RecordedAnthropicClient:
    """Client that records real API calls and can replay them."""

    def __init__(self, record_path: str, mode: str = "replay"):
        self.record_path = Path(record_path)
        self.mode = mode  # "record" or "replay"
        self._calls = []
        self._index = 0

        if mode == "replay" and self.record_path.exists():
            self._calls = json.loads(self.record_path.read_text())

    def messages_create(self, **kwargs):
        if self.mode == "record":
            import anthropic
            client = anthropic.Anthropic()
            response = client.messages.create(**kwargs)
            self._calls.append(response.model_dump())
            return response
        else:
            if self._index >= len(self._calls):
                raise RuntimeError(f"No more recorded responses. Index {self._index}, have {len(self._calls)}")
            call_data = self._calls[self._index]
            self._index += 1
            # Reconstruct response object from recorded data
            return anthropic.types.Message(**call_data)

    def save_recording(self):
        self.record_path.write_text(json.dumps(self._calls, indent=2))

Layer 3: Eval Harness for Output Quality

An eval harness runs your agent against a fixed set of test cases with known correct answers, scores each output, and tracks quality over time.

import json
import re
from dataclasses import dataclass, field
from typing import Callable
from pathlib import Path


@dataclass
class EvalCase:
    """A single evaluation test case."""
    id: str
    input: str
    expected_properties: dict  # Properties the output should have
    context: dict = field(default_factory=dict)  # Optional extra context


@dataclass
class EvalResult:
    case_id: str
    output: str
    scores: dict
    passed: bool
    failure_reasons: list


class AgentEvalHarness:
    """Run eval cases and score agent outputs."""

    def __init__(self, agent_fn: Callable[[str], str]):
        self.agent_fn = agent_fn
        self.results = []

    def run(self, cases: list[EvalCase]) -> dict:
        """Run all eval cases and return aggregate scores."""
        self.results = []
        for case in cases:
            try:
                output = self.agent_fn(case.input)
                result = self._score(case, output)
            except Exception as e:
                result = EvalResult(
                    case_id=case.id,
                    output=f"ERROR: {e}",
                    scores={},
                    passed=False,
                    failure_reasons=[f"Agent raised exception: {e}"]
                )
            self.results.append(result)

        return self._aggregate()

    def _score(self, case: EvalCase, output: str) -> EvalResult:
        scores = {}
        failures = []

        for check_name, check_value in case.expected_properties.items():
            if check_name == "contains":
                # Output must contain all specified strings
                for phrase in check_value:
                    if phrase.lower() not in output.lower():
                        failures.append(f"Missing '{phrase}' in output")
                scores["contains"] = len(failures) == 0

            elif check_name == "not_contains":
                for phrase in check_value:
                    if phrase.lower() in output.lower():
                        failures.append(f"Output contains forbidden phrase: '{phrase}'")
                scores["not_contains"] = not any(
                    p.lower() in output.lower() for p in check_value
                )

            elif check_name == "min_length":
                passes = len(output) >= check_value
                if not passes:
                    failures.append(f"Output too short: {len(output)} < {check_value}")
                scores["min_length"] = passes

            elif check_name == "max_length":
                passes = len(output) <= check_value
                if not passes:
                    failures.append(f"Output too long: {len(output)} > {check_value}")
                scores["max_length"] = passes

            elif check_name == "json_valid":
                try:
                    json.loads(output)
                    scores["json_valid"] = True
                except json.JSONDecodeError:
                    failures.append("Output is not valid JSON")
                    scores["json_valid"] = False

            elif check_name == "regex":
                matches = bool(re.search(check_value, output, re.IGNORECASE))
                if not matches:
                    failures.append(f"Output does not match regex: {check_value}")
                scores["regex"] = matches

        passed = len(failures) == 0
        return EvalResult(
            case_id=case.id,
            output=output,
            scores=scores,
            passed=passed,
            failure_reasons=failures
        )

    def _aggregate(self) -> dict:
        total = len(self.results)
        passed = sum(1 for r in self.results if r.passed)
        return {
            "total": total,
            "passed": passed,
            "failed": total - passed,
            "pass_rate": passed / total if total > 0 else 0,
            "results": [
                {
                    "id": r.case_id,
                    "passed": r.passed,
                    "failures": r.failure_reasons
                }
                for r in self.results
            ]
        }


# Example eval dataset
INVOICE_AGENT_EVALS = [
    EvalCase(
        id="basic_extraction",
        input="Extract data from: Invoice #1234 from Acme Corp, $500 due 2026-05-01",
        expected_properties={
            "contains": ["Acme Corp", "500", "1234"],
            "json_valid": True,
        }
    ),
    EvalCase(
        id="missing_amount",
        input="Extract: Invoice from Beta LLC dated today",
        expected_properties={
            "contains": ["Beta LLC"],
            "not_contains": ["$0", "amount: null"],
        }
    ),
    EvalCase(
        id="multi_item_invoice",
        input="Parse: Invoice #9999, Items: Widget A x2 @$25, Widget B x1 @$50. Total $100.",
        expected_properties={
            "contains": ["100", "9999"],
            "min_length": 100,
        }
    ),
]

Layer 4: Regression Testing in CI

Run the eval harness before every deploy to catch prompt regressions.

# scripts/run_evals.py — run in CI before deployment
import sys
import json
from your_agent.agent import run_agent
from your_agent.evals import INVOICE_AGENT_EVALS, AgentEvalHarness

PASS_THRESHOLD = 0.85  # Fail CI if pass rate drops below 85%


def main():
    harness = AgentEvalHarness(agent_fn=run_agent)
    results = harness.run(INVOICE_AGENT_EVALS)

    print(f"\nEval Results: {results['passed']}/{results['total']} passed ({results['pass_rate']:.1%})")

    for r in results["results"]:
        status = "✅" if r["passed"] else "❌"
        print(f"  {status} {r['id']}")
        if r["failures"]:
            for f in r["failures"]:
                print(f"       ↳ {f}")

    if results["pass_rate"] < PASS_THRESHOLD:
        print(f"\nFAILED: pass rate {results['pass_rate']:.1%} < threshold {PASS_THRESHOLD:.1%}")
        sys.exit(1)
    else:
        print(f"\nPASSED: pass rate {results['pass_rate']:.1%} >= threshold {PASS_THRESHOLD:.1%}")
        sys.exit(0)


if __name__ == "__main__":
    main()
# .github/workflows/eval.yml
name: Agent Eval
on:
  pull_request:
    paths:
      - 'your_agent/**'
      - 'prompts/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install -r requirements.txt
      - run: python scripts/run_evals.py
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Eval Dataset Management

Your eval dataset is a first-class artifact — version it, grow it, and review it.

Rules for a good eval dataset:

# Add a production failure as an eval case
def add_eval_from_production_failure(
    user_input: str,
    bad_output: str,
    expected_properties: dict,
    case_id: str,
    eval_file: str = "evals/cases.json"
):
    """Record a production failure as a regression test."""
    new_case = {
        "id": case_id,
        "input": user_input,
        "expected_properties": expected_properties,
        "added_from": "production",
        "example_bad_output": bad_output[:200]  # For reference
    }

    path = Path(eval_file)
    cases = json.loads(path.read_text()) if path.exists() else []
    cases.append(new_case)
    path.write_text(json.dumps(cases, indent=2))
    print(f"Added eval case: {case_id}")

Frequently Asked Questions

Do I need to test Claude agents if I'm not changing the underlying model? Yes — prompts change, tools change, and system prompts drift. Any change to the agent's inputs can change outputs. Regression tests catch prompt regressions that would otherwise only surface in production.

Should I use real API calls in my eval harness or mock responses? Both. Unit tests and integration tests should use mocks for speed and reliability. The eval harness should use real API calls to measure actual output quality — mocking defeats the purpose of evaluating model behavior.

How many eval cases do I need? Start with 20-30 cases covering your main use cases and known failure modes. Add cases from production failures as they occur. At 50+ cases with 85%+ pass rate, you have solid regression coverage.

What's the right pass threshold? Start at 80%. After your first month of running evals, raise it toward 90% as you understand where the model reliably succeeds. Don't set it at 100% — LLM outputs are probabilistic and minor variation is expected.

How do I evaluate subjective output quality (tone, helpfulness)? Use a grader — a second Claude call that rates the output on a 1-5 scale with structured criteria. This is called LLM-as-judge evaluation. It adds cost but scales to subjective quality checks.


Related Guides


Go Deeper

Agent SDK Cookbook — $49 — Full testing infrastructure: eval harness with LLM-as-judge scoring, CI integration templates, recorded/replay client for deterministic integration tests, and 30 production eval datasets across common agent types.

→ Get the Agent SDK Cookbook — $49

30-day money-back guarantee. Instant download.

AI Disclosure: Written with Claude Code; patterns tested in production agent deployments.

Tools and references