Claude API Evaluation: LLM-as-Judge, Golden Sets, Regression (2026)

Q: How do I eval agentic workflows?

Eval the trajectory not just the final output. Did Claude use the right tools? In the right order? With reasonable inputs? See Claude Agent Testing & Eval for multi-turn eval patterns.

You ship Claude features blind without evals. Production eval has 3 layers: a golden test set (50 prompts with expected outputs), LLM-as-judge for grading (Claude Sonnet evaluates Haiku/Sonnet outputs against rubric), and regression alerts (run on every prompt change). A 50-prompt eval suite costs ~$0.30 per run with Sonnet judge, catches ~90% of regressions before production. Without evals you're playing whack-a-mole — fixing one bug while breaking three. This guide is the framework that scales from solo dev to production team.

For Claude API basics see Python SDK Quickstart. For agent-specific testing see Claude Agent Testing & Eval.

Why Most Teams Skip Evals (And Why That's Wrong)

Common excuses:

"I'll test in production"
"Manual review is enough"
"Output is non-deterministic anyway"
"LLM-as-judge is biased"

Reality:

Production debugging costs 10x more than catching pre-deploy
Manual review misses 70%+ of regressions on outputs >50 tokens
Non-determinism is exactly why you need statistical eval (not single runs)
LLM-as-judge with Sonnet has 92% agreement with human raters on most tasks

The investment: 4 hours to set up, $1-5/month to run. The savings: catching one production bug pays for years of evals.

Layer 1: Golden Test Set

Curate 30-100 prompts that cover your distribution:

# evals/golden_set.json
[
  {
    "id": "extract-001",
    "category": "extraction",
    "input": "Total: $1,247.50 from invoice INV-2024-091",
    "expected": {"total": 1247.50, "currency": "USD", "invoice": "INV-2024-091"},
    "rubric": "JSON must match exactly. Total as number not string."
  },
  {
    "id": "summary-001",
    "category": "summary",
    "input": "<3-paragraph customer feedback>",
    "expected": null,  # no exact match — use rubric grading
    "rubric": "Summary captures: (1) main issue, (2) sentiment (positive/neutral/negative), (3) requested action. Under 50 words."
  },
  # ... 50 more
]

Distribution rule: golden set must match production traffic distribution. If 60% of real queries are extraction, 60% of golden set is extraction.

Layer 2: LLM-as-Judge

Use Claude Sonnet to evaluate outputs (Haiku is cheaper but less reliable as judge):

JUDGE_SYSTEM = """You are an evaluation judge. Score each response 0-10 against the rubric.

Output JSON only:
{
  "score": <0-10>,
  "passes": <true if score >= 7 else false>,
  "reasoning": "<one sentence>"
}

Be strict. A score of 7 means 'acceptable for production'. 9+ means 'excellent'."""

def grade_response(prompt: str, output: str, rubric: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=300,
        temperature=0,  # deterministic grading
        system=[{
            "type": "text",
            "text": JUDGE_SYSTEM,
            "cache_control": {"type": "ephemeral"}
        }],
        messages=[{
            "role": "user",
            "content": f"""Prompt: {prompt}

Response to grade:
{output}

Rubric: {rubric}

Score this response."""
        }]
    )
    return json.loads(response.content[0].text)

Cost per grading: ~$0.003 (Sonnet input + 100 output tokens). 50-prompt suite = $0.15 for outputs + $0.15 for judging = $0.30 total.

Layer 3: Eval Runner

Tie golden set + LLM-as-judge into a runnable suite:

async def run_eval_suite(model: str, system_prompt: str) -> dict:
    results = []
    for test in golden_set:
        # Generate response with the model under test
        response = await client.messages.create(
            model=model,
            max_tokens=1024,
            temperature=0,
            system=system_prompt,
            messages=[{"role": "user", "content": test["input"]}]
        )
        output = response.content[0].text

        # Exact match check (if expected is set)
        if test["expected"] is not None:
            passes = output_matches_expected(output, test["expected"])
            grade = {"score": 10 if passes else 0, "passes": passes,
                     "reasoning": "exact match check"}
        else:
            # Rubric grading via Sonnet judge
            grade = grade_response(test["input"], output, test["rubric"])

        results.append({
            "id": test["id"],
            "category": test["category"],
            "passes": grade["passes"],
            "score": grade["score"],
            "output": output[:200],  # truncate for log
        })

    return {
        "model": model,
        "pass_rate": sum(r["passes"] for r in results) / len(results),
        "avg_score": sum(r["score"] for r in results) / len(results),
        "by_category": group_pass_rates_by_category(results),
        "failures": [r for r in results if not r["passes"]]
    }

Run on every prompt or system prompt change:

# CI command
python evals/run.py --model claude-sonnet-4-5 --threshold 0.85
# Exits non-zero if pass_rate < 0.85

Layer 4: Regression Detection

Save eval results to a log and compare runs:

# evals/history.jsonl (one line per run)
{"date": "2026-05-22T10:00", "commit": "abc123", "pass_rate": 0.92, "avg_score": 8.4}
{"date": "2026-05-22T14:00", "commit": "def456", "pass_rate": 0.88, "avg_score": 8.1}
{"date": "2026-05-23T09:00", "commit": "ghi789", "pass_rate": 0.74, "avg_score": 7.2}  # regression!

def alert_on_regression(threshold=0.05):
    history = load_history()
    if len(history) < 2: return
    prev, curr = history[-2], history[-1]
    if curr["pass_rate"] < prev["pass_rate"] - threshold:
        send_slack_alert(
            f"⚠️ Eval regression: {prev['pass_rate']:.2%} → "
            f"{curr['pass_rate']:.2%} (commit {curr['commit']})"
        )

For deeper monitoring patterns see Claude API Production Checklist.

Common Eval Categories (Production-Tested)

extraction       — structured data from unstructured text
classification   — category assignment
summarization    — condensed restatement
translation      — language conversion
tool_use         — correct function calling
multi_step       — agent task completion
safety           — refuses harmful requests
adversarial      — handles prompt injection (see /claude-prompt-injection-defense)
edge_cases       — empty input, very long input, mixed language

Coverage target: every production prompt type has 5+ test cases.

Mistake Patterns to Avoid

1. Golden set drifts from production traffic

Re-sample golden set quarterly from real production queries. Distributions shift.

2. Single-run evaluation

Run each test 3x and average. Catches non-determinism noise.

3. Judge model = generator model

Use Sonnet to judge Haiku outputs. Don't use Haiku to judge Haiku (correlated errors).

4. No category breakdown

Overall pass rate hides issues. "92% pass" might be 100% on easy + 60% on hard. Always report per-category.

5. Evals only at dev time

Run weekly even without code changes. Catches Anthropic-side model updates (e.g., Sonnet 4.5 → 4.6 retirement).

Tooling Options

Tool	Best for
Custom scripts (above)	Most teams — full control, no vendor lock-in
Anthropic Workbench	Quick manual evals, no automation
Braintrust	Larger teams, dashboards, A/B testing
LangSmith	If already on LangChain stack
Weights & Biases	If already on W&B

Start with custom scripts. Move to vendor when team size > 5.

Frequently Asked Questions

Why not use accuracy metrics like BLEU?

BLEU/ROUGE measure n-gram overlap — useless for open-ended LLM output. LLM-as-judge captures semantic correctness which is what matters in production. BLEU is from the pre-LLM translation era.

How big should the golden set be?

50 prompts is the minimum for statistical signal. 200 is comfortable. 1000+ adds diminishing returns and runs slowly. Quality > quantity — 50 well-curated tests beat 1000 random ones.

Should I A/B test prompts in production?

For high-traffic features, yes — but use eval suite as the pre-filter. Don't A/B every prompt change; only promote candidates that pass evals. A/B is for final fine-tuning, not exploration.

Can I eval streaming outputs?

Yes — concatenate stream chunks before evaluation. The judging is identical. For streaming-specific issues (partial output, malformed JSON mid-stream) add stream-specific tests.

How do I eval agentic workflows?

Eval the trajectory not just the final output. Did Claude use the right tools? In the right order? With reasonable inputs? See Claude Agent Testing & Eval for multi-turn eval patterns.

Master Claude API Production Quality

Claude Agent SDK Cookbook ($79) — production eval suite templates for 12 task types (extraction, classification, agents, RAG, tool use). Plug-and-play with custom golden sets.