Claude API Evaluation: LLM-as-Judge, Golden Sets, Regression (2026)
You ship Claude features blind without evals. Production eval has 3 layers: a golden test set (50 prompts with expected outputs), LLM-as-judge for grading (Claude Sonnet evaluates Haiku/Sonnet outputs against rubric), and regression alerts (run on every prompt change). A 50-prompt eval suite costs ~$0.30 per run with Sonnet judge, catches ~90% of regressions before production. Without evals you're playing whack-a-mole β fixing one bug while breaking three. This guide is the framework that scales from solo dev to production team.
For Claude API basics see Python SDK Quickstart. For agent-specific testing see Claude Agent Testing & Eval.
Why Most Teams Skip Evals (And Why That's Wrong)
Common excuses:
- "I'll test in production"
- "Manual review is enough"
- "Output is non-deterministic anyway"
- "LLM-as-judge is biased"
Reality:
- Production debugging costs 10x more than catching pre-deploy
- Manual review misses 70%+ of regressions on outputs >50 tokens
- Non-determinism is exactly why you need statistical eval (not single runs)
- LLM-as-judge with Sonnet has 92% agreement with human raters on most tasks
The investment: 4 hours to set up, $1-5/month to run. The savings: catching one production bug pays for years of evals.
Layer 1: Golden Test Set
Curate 30-100 prompts that cover your distribution:
# evals/golden_set.json
[
{
"id": "extract-001",
"category": "extraction",
"input": "Total: $1,247.50 from invoice INV-2024-091",
"expected": {"total": 1247.50, "currency": "USD", "invoice": "INV-2024-091"},
"rubric": "JSON must match exactly. Total as number not string."
},
{
"id": "summary-001",
"category": "summary",
"input": "<3-paragraph customer feedback>",
"expected": null, # no exact match β use rubric grading
"rubric": "Summary captures: (1) main issue, (2) sentiment (positive/neutral/negative), (3) requested action. Under 50 words."
},
# ... 50 more
]
Distribution rule: golden set must match production traffic distribution. If 60% of real queries are extraction, 60% of golden set is extraction.
Layer 2: LLM-as-Judge
Use Claude Sonnet to evaluate outputs (Haiku is cheaper but less reliable as judge):
JUDGE_SYSTEM = """You are an evaluation judge. Score each response 0-10 against the rubric.
Output JSON only:
{
"score": <0-10>,
"passes": <true if score >= 7 else false>,
"reasoning": "<one sentence>"
}
Be strict. A score of 7 means 'acceptable for production'. 9+ means 'excellent'."""
def grade_response(prompt: str, output: str, rubric: str) -> dict:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=300,
temperature=0, # deterministic grading
system=[{
"type": "text",
"text": JUDGE_SYSTEM,
"cache_control": {"type": "ephemeral"}
}],
messages=[{
"role": "user",
"content": f"""Prompt: {prompt}
Response to grade:
{output}
Rubric: {rubric}
Score this response."""
}]
)
return json.loads(response.content[0].text)
Cost per grading: ~$0.003 (Sonnet input + 100 output tokens). 50-prompt suite = $0.15 for outputs + $0.15 for judging = $0.30 total.
Layer 3: Eval Runner
Tie golden set + LLM-as-judge into a runnable suite:
async def run_eval_suite(model: str, system_prompt: str) -> dict:
results = []
for test in golden_set:
# Generate response with the model under test
response = await client.messages.create(
model=model,
max_tokens=1024,
temperature=0,
system=system_prompt,
messages=[{"role": "user", "content": test["input"]}]
)
output = response.content[0].text
# Exact match check (if expected is set)
if test["expected"] is not None:
passes = output_matches_expected(output, test["expected"])
grade = {"score": 10 if passes else 0, "passes": passes,
"reasoning": "exact match check"}
else:
# Rubric grading via Sonnet judge
grade = grade_response(test["input"], output, test["rubric"])
results.append({
"id": test["id"],
"category": test["category"],
"passes": grade["passes"],
"score": grade["score"],
"output": output[:200], # truncate for log
})
return {
"model": model,
"pass_rate": sum(r["passes"] for r in results) / len(results),
"avg_score": sum(r["score"] for r in results) / len(results),
"by_category": group_pass_rates_by_category(results),
"failures": [r for r in results if not r["passes"]]
}
Run on every prompt or system prompt change:
# CI command
python evals/run.py --model claude-sonnet-4-5 --threshold 0.85
# Exits non-zero if pass_rate < 0.85
Layer 4: Regression Detection
Save eval results to a log and compare runs:
# evals/history.jsonl (one line per run)
{"date": "2026-05-22T10:00", "commit": "abc123", "pass_rate": 0.92, "avg_score": 8.4}
{"date": "2026-05-22T14:00", "commit": "def456", "pass_rate": 0.88, "avg_score": 8.1}
{"date": "2026-05-23T09:00", "commit": "ghi789", "pass_rate": 0.74, "avg_score": 7.2} # regression!
def alert_on_regression(threshold=0.05):
history = load_history()
if len(history) < 2: return
prev, curr = history[-2], history[-1]
if curr["pass_rate"] < prev["pass_rate"] - threshold:
send_slack_alert(
f"β οΈ Eval regression: {prev['pass_rate']:.2%} β "
f"{curr['pass_rate']:.2%} (commit {curr['commit']})"
)
For deeper monitoring patterns see Claude API Production Checklist.
Common Eval Categories (Production-Tested)
extraction β structured data from unstructured text
classification β category assignment
summarization β condensed restatement
translation β language conversion
tool_use β correct function calling
multi_step β agent task completion
safety β refuses harmful requests
adversarial β handles prompt injection (see /claude-prompt-injection-defense)
edge_cases β empty input, very long input, mixed language
Coverage target: every production prompt type has 5+ test cases.
Mistake Patterns to Avoid
1. Golden set drifts from production traffic
Re-sample golden set quarterly from real production queries. Distributions shift.
2. Single-run evaluation
Run each test 3x and average. Catches non-determinism noise.
3. Judge model = generator model
Use Sonnet to judge Haiku outputs. Don't use Haiku to judge Haiku (correlated errors).
4. No category breakdown
Overall pass rate hides issues. "92% pass" might be 100% on easy + 60% on hard. Always report per-category.
5. Evals only at dev time
Run weekly even without code changes. Catches Anthropic-side model updates (e.g., Sonnet 4.5 β 4.6 retirement).
Tooling Options
| Tool | Best for |
|---|---|
| Custom scripts (above) | Most teams β full control, no vendor lock-in |
| Anthropic Workbench | Quick manual evals, no automation |
| Braintrust | Larger teams, dashboards, A/B testing |
| LangSmith | If already on LangChain stack |
| Weights & Biases | If already on W&B |
Start with custom scripts. Move to vendor when team size > 5.
Frequently Asked Questions
Why not use accuracy metrics like BLEU?
BLEU/ROUGE measure n-gram overlap β useless for open-ended LLM output. LLM-as-judge captures semantic correctness which is what matters in production. BLEU is from the pre-LLM translation era.
How big should the golden set be?
50 prompts is the minimum for statistical signal. 200 is comfortable. 1000+ adds diminishing returns and runs slowly. Quality > quantity β 50 well-curated tests beat 1000 random ones.
Should I A/B test prompts in production?
For high-traffic features, yes β but use eval suite as the pre-filter. Don't A/B every prompt change; only promote candidates that pass evals. A/B is for final fine-tuning, not exploration.
Can I eval streaming outputs?
Yes β concatenate stream chunks before evaluation. The judging is identical. For streaming-specific issues (partial output, malformed JSON mid-stream) add stream-specific tests.
How do I eval agentic workflows?
Eval the trajectory not just the final output. Did Claude use the right tools? In the right order? With reasonable inputs? See Claude Agent Testing & Eval for multi-turn eval patterns.
Master Claude API Production Quality
Claude Agent SDK Cookbook ($79) β production eval suite templates for 12 task types (extraction, classification, agents, RAG, tool use). Plug-and-play with custom golden sets.