Claude API Model Fallback & Circuit Breaker: Production Resilience (2026)
Anthropic's API has 99.9% uptime but real-world disruption is more frequent — 429 rate limits, 529 overloaded responses, model-specific outages, transient network errors. Production agents need 5 resilience patterns: model fallback (Sonnet → Haiku → cached response), circuit breaker (stop hammering a failing endpoint), exponential backoff with jitter, bulkhead isolation (rate limits per feature), and graceful degradation paths. Without these, a 5-minute Anthropic blip becomes a 5-hour customer-facing outage. This guide is patterns from 12 production retrospectives — what worked, what didn't.
For Claude API basics see Rate Limits & 429 Recovery. For error handling fundamentals see Claude API Error Handling.
The 5 Patterns at a Glance
| Pattern | Prevents | Cost |
|---|---|---|
| Model fallback | Single-model outage | Quality drop on fallback |
| Circuit breaker | Cascading failure | 30-60s blackout window |
| Exp backoff + jitter | Thundering herd | Latency spike during retry |
| Bulkhead | Feature A killing feature B | Lower effective rate limit |
| Graceful degradation | Total feature loss | Reduced functionality |
Pattern 1: Model Fallback Chain
Sonnet down? Fall to Haiku. Haiku down? Fall to cached response. Always have an answer.
import anthropic
from anthropic import APIError
FALLBACK_CHAIN = [
{"model": "claude-sonnet-4-5", "max_tokens": 1024},
{"model": "claude-haiku-3-5", "max_tokens": 1024},
]
async def call_with_fallback(messages, system=None):
for i, config in enumerate(FALLBACK_CHAIN):
try:
return await client.messages.create(
**config,
system=system,
messages=messages,
timeout=15
)
except APIError as e:
if e.status_code in (429, 500, 502, 503, 529):
logger.warning(f"Model {config['model']} failed: {e}; trying next")
continue
raise
except TimeoutError:
logger.warning(f"Model {config['model']} timed out")
continue
# All models failed — return cached or canned response
return get_cached_response(messages) or {
"content": [{"text": "Service temporarily unavailable. Please retry."}]
}
Quality trade-off: Haiku is 70-85% as capable as Sonnet for most tasks. Fallback degrades quality, not features. Document this in your SLA. See Haiku vs Sonnet vs Opus for capability comparison.
Pattern 2: Circuit Breaker
Stop pounding on a failing service. After N failures, open the circuit and skip calls for a cooldown period.
import time
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal — requests pass through
OPEN = "open" # Failures — block requests
HALF_OPEN = "half_open" # Probing — allow one request
class CircuitBreaker:
def __init__(self, failure_threshold=5, cooldown=60):
self.state = CircuitState.CLOSED
self.failure_count = 0
self.last_failure_time = 0
self.failure_threshold = failure_threshold
self.cooldown = cooldown
def can_attempt(self) -> bool:
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.cooldown:
self.state = CircuitState.HALF_OPEN
return True
return False
# HALF_OPEN — allow probe
return True
def record_success(self):
self.failure_count = 0
self.state = CircuitState.CLOSED
def record_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
# Per-model circuit
breakers = {
"claude-sonnet-4-5": CircuitBreaker(),
"claude-haiku-3-5": CircuitBreaker()
}
async def call_with_circuit(model, messages):
breaker = breakers[model]
if not breaker.can_attempt():
raise RuntimeError(f"Circuit OPEN for {model}")
try:
result = await client.messages.create(model=model, ...)
breaker.record_success()
return result
except APIError:
breaker.record_failure()
raise
Effect: 5 failures in 60s → 60s cooldown → 1 probe → if success, fully reopen. Prevents 1000 retries hammering a down service.
Pattern 3: Exponential Backoff with Jitter
Retry, but exponentially increase delay AND add randomness to avoid thundering herd:
import random
import asyncio
async def call_with_backoff(messages, max_retries=5):
for attempt in range(max_retries):
try:
return await client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=messages
)
except APIError as e:
if attempt == max_retries - 1:
raise
if e.status_code not in (429, 500, 503, 529):
raise
# Honor retry-after header if present
retry_after = e.response.headers.get("retry-after")
if retry_after:
base_delay = float(retry_after)
else:
# Exponential: 1s, 2s, 4s, 8s, 16s
base_delay = 2 ** attempt
# Add jitter (±25%)
delay = base_delay * (0.75 + random.random() * 0.5)
await asyncio.sleep(delay)
Why jitter matters: 1000 clients all retrying at exactly 2s = thundering herd. Jitter spreads retries across 1.5-2.5s window.
Pattern 4: Bulkhead (Rate Limit Per Feature)
Don't let one runaway feature exhaust your entire API quota.
import asyncio
# Separate semaphores per feature
SEMAPHORES = {
"chat": asyncio.Semaphore(20), # 20 concurrent chat calls
"summary": asyncio.Semaphore(10), # 10 concurrent summaries
"extraction": asyncio.Semaphore(30), # 30 concurrent extractions
"background_jobs": asyncio.Semaphore(5) # 5 background only
}
async def call_claude_bulkhead(feature: str, messages):
sem = SEMAPHORES[feature]
async with sem:
return await client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=messages
)
Effect: a runaway background job can't exhaust the chat quota. Each feature has guaranteed bandwidth.
Pattern 5: Graceful Degradation
If everything fails, what does the user see?
async def get_user_response(user_msg: str, context: dict):
# Try full Claude flow
try:
return await call_with_fallback(
messages=[{"role": "user", "content": user_msg}],
system=context.get("system_prompt")
)
except Exception as e:
logger.error(f"Claude completely unavailable: {e}")
# Degraded path: serve cached similar response
cached = await find_similar_cached(user_msg)
if cached:
return {
"content": [{"text": cached}],
"_degraded": True
}
# Last resort: canned message with retry guidance
return {
"content": [{"text":
"I'm experiencing issues right now. "
"Please try again in a minute, or contact support@example.com."
}],
"_degraded": True
}
UX rule: never show a 500 page when you can show degraded but functional content.
Putting It Together: Production Wrapper
class ResilientClaude:
def __init__(self):
self.breakers = {m: CircuitBreaker() for m in ["claude-sonnet-4-5", "claude-haiku-3-5"]}
self.semaphores = {/* per-feature */}
async def call(self, messages, feature: str, system=None):
# Bulkhead
async with self.semaphores[feature]:
# Fallback chain with circuit breakers
for config in FALLBACK_CHAIN:
model = config["model"]
if not self.breakers[model].can_attempt():
continue
try:
# Exponential backoff
result = await self._call_with_backoff(
model=model, messages=messages, system=system
)
self.breakers[model].record_success()
return result
except APIError:
self.breakers[model].record_failure()
continue
# All paths exhausted → graceful degradation
return await self._degraded_response(messages)
This is what real production code looks like — boring, resilient, observable.
Monitoring + Alerting
import logging
from collections import defaultdict
metrics = {
"calls_total": defaultdict(int),
"calls_failed": defaultdict(int),
"fallback_used": defaultdict(int),
"circuit_opened": defaultdict(int)
}
def log_call(model: str, status: str):
metrics["calls_total"][model] += 1
if status == "failed":
metrics["calls_failed"][model] += 1
# Alert if fallback rate > 5% in last hour
def check_health():
for model in metrics["calls_total"]:
total = metrics["calls_total"][model]
failed = metrics["calls_failed"][model]
if total > 100 and failed / total > 0.05:
send_alert(f"{model} failure rate: {failed/total:.1%}")
Integrate with Datadog, Sentry, or PostHog. See Claude API Production Checklist for the full monitoring stack.
Frequently Asked Questions
Should I always fall back from Sonnet to Haiku?
Depends on task. For chat, summarization, classification: yes — Haiku is good enough. For code generation, complex reasoning: serving degraded response or retry-later is better than wrong Haiku output. Make fallback per-feature.
How long should circuit breaker cooldown be?
Start at 60s. Match it to Anthropic's typical recovery time. Too short (10s) → false-opens that hurt UX. Too long (5min) → unnecessary downtime. Tune based on observed recovery patterns.
Why not just use exponential backoff and skip the circuit breaker?
Backoff retries forever. Circuit breaker stops trying entirely after threshold. Both are needed: backoff for transient errors, circuit for sustained outages. Without the circuit, a 5-minute outage = thousands of wasted retry calls.
Does Anthropic provide an SLA?
Anthropic publishes 99.9% uptime targets but no formal SLA on standard tier. Enterprise customers get contractual SLAs. Either way: design for outages because they happen.
Can I implement these in n8n or Zapier?
Partially. n8n's Wait + IF nodes can do basic backoff and circuit logic but lack the sophistication of code-level implementations. For mission-critical Claude features, write a thin proxy service in Python/TypeScript and call THAT from n8n. See Claude API + n8n.
Master Production Claude API
Claude Agent SDK Cookbook ($79) — 40 production recipes with resilience patterns baked in: fallback, circuit breaker, observability, cost guardrails. Battle-tested across 30+ deployments.