Claude API Model Fallback & Circuit Breaker: Production Resilience (2026)

Anthropic's API has 99.9% uptime but real-world disruption is more frequent — 429 rate limits, 529 overloaded responses, model-specific outages, transient network errors. Production agents need 5 resilience patterns: model fallback (Sonnet → Haiku → cached response), circuit breaker (stop hammering a failing endpoint), exponential backoff with jitter, bulkhead isolation (rate limits per feature), and graceful degradation paths. Without these, a 5-minute Anthropic blip becomes a 5-hour customer-facing outage. This guide is patterns from 12 production retrospectives — what worked, what didn't.

For Claude API basics see Rate Limits & 429 Recovery. For error handling fundamentals see Claude API Error Handling.

The 5 Patterns at a Glance

Pattern	Prevents	Cost
Model fallback	Single-model outage	Quality drop on fallback
Circuit breaker	Cascading failure	30-60s blackout window
Exp backoff + jitter	Thundering herd	Latency spike during retry
Bulkhead	Feature A killing feature B	Lower effective rate limit
Graceful degradation	Total feature loss	Reduced functionality

Pattern 1: Model Fallback Chain

Sonnet down? Fall to Haiku. Haiku down? Fall to cached response. Always have an answer.

import anthropic
from anthropic import APIError

FALLBACK_CHAIN = [
    {"model": "claude-sonnet-4-5", "max_tokens": 1024},
    {"model": "claude-haiku-3-5", "max_tokens": 1024},
]

async def call_with_fallback(messages, system=None):
    for i, config in enumerate(FALLBACK_CHAIN):
        try:
            return await client.messages.create(
                **config,
                system=system,
                messages=messages,
                timeout=15
            )
        except APIError as e:
            if e.status_code in (429, 500, 502, 503, 529):
                logger.warning(f"Model {config['model']} failed: {e}; trying next")
                continue
            raise
        except TimeoutError:
            logger.warning(f"Model {config['model']} timed out")
            continue

    # All models failed — return cached or canned response
    return get_cached_response(messages) or {
        "content": [{"text": "Service temporarily unavailable. Please retry."}]
    }

Quality trade-off: Haiku is 70-85% as capable as Sonnet for most tasks. Fallback degrades quality, not features. Document this in your SLA. See Haiku vs Sonnet vs Opus for capability comparison.

Pattern 2: Circuit Breaker

Stop pounding on a failing service. After N failures, open the circuit and skip calls for a cooldown period.

import time
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"      # Normal — requests pass through
    OPEN = "open"          # Failures — block requests
    HALF_OPEN = "half_open"  # Probing — allow one request

class CircuitBreaker:
    def __init__(self, failure_threshold=5, cooldown=60):
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = 0
        self.failure_threshold = failure_threshold
        self.cooldown = cooldown

    def can_attempt(self) -> bool:
        if self.state == CircuitState.CLOSED:
            return True
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.cooldown:
                self.state = CircuitState.HALF_OPEN
                return True
            return False
        # HALF_OPEN — allow probe
        return True

    def record_success(self):
        self.failure_count = 0
        self.state = CircuitState.CLOSED

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

# Per-model circuit
breakers = {
    "claude-sonnet-4-5": CircuitBreaker(),
    "claude-haiku-3-5": CircuitBreaker()
}

async def call_with_circuit(model, messages):
    breaker = breakers[model]
    if not breaker.can_attempt():
        raise RuntimeError(f"Circuit OPEN for {model}")

    try:
        result = await client.messages.create(model=model, ...)
        breaker.record_success()
        return result
    except APIError:
        breaker.record_failure()
        raise

Effect: 5 failures in 60s → 60s cooldown → 1 probe → if success, fully reopen. Prevents 1000 retries hammering a down service.

Pattern 3: Exponential Backoff with Jitter

Retry, but exponentially increase delay AND add randomness to avoid thundering herd:

import random
import asyncio

async def call_with_backoff(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            return await client.messages.create(
                model="claude-sonnet-4-5",
                max_tokens=1024,
                messages=messages
            )
        except APIError as e:
            if attempt == max_retries - 1:
                raise
            if e.status_code not in (429, 500, 503, 529):
                raise

            # Honor retry-after header if present
            retry_after = e.response.headers.get("retry-after")
            if retry_after:
                base_delay = float(retry_after)
            else:
                # Exponential: 1s, 2s, 4s, 8s, 16s
                base_delay = 2 ** attempt

            # Add jitter (±25%)
            delay = base_delay * (0.75 + random.random() * 0.5)
            await asyncio.sleep(delay)

Why jitter matters: 1000 clients all retrying at exactly 2s = thundering herd. Jitter spreads retries across 1.5-2.5s window.

Pattern 4: Bulkhead (Rate Limit Per Feature)

Don't let one runaway feature exhaust your entire API quota.

import asyncio

# Separate semaphores per feature
SEMAPHORES = {
    "chat": asyncio.Semaphore(20),         # 20 concurrent chat calls
    "summary": asyncio.Semaphore(10),      # 10 concurrent summaries
    "extraction": asyncio.Semaphore(30),   # 30 concurrent extractions
    "background_jobs": asyncio.Semaphore(5)  # 5 background only
}

async def call_claude_bulkhead(feature: str, messages):
    sem = SEMAPHORES[feature]
    async with sem:
        return await client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1024,
            messages=messages
        )

Effect: a runaway background job can't exhaust the chat quota. Each feature has guaranteed bandwidth.

Pattern 5: Graceful Degradation

If everything fails, what does the user see?

async def get_user_response(user_msg: str, context: dict):
    # Try full Claude flow
    try:
        return await call_with_fallback(
            messages=[{"role": "user", "content": user_msg}],
            system=context.get("system_prompt")
        )
    except Exception as e:
        logger.error(f"Claude completely unavailable: {e}")

    # Degraded path: serve cached similar response
    cached = await find_similar_cached(user_msg)
    if cached:
        return {
            "content": [{"text": cached}],
            "_degraded": True
        }

    # Last resort: canned message with retry guidance
    return {
        "content": [{"text":
            "I'm experiencing issues right now. "
            "Please try again in a minute, or contact support@example.com."
        }],
        "_degraded": True
    }

UX rule: never show a 500 page when you can show degraded but functional content.

Putting It Together: Production Wrapper

class ResilientClaude:
    def __init__(self):
        self.breakers = {m: CircuitBreaker() for m in ["claude-sonnet-4-5", "claude-haiku-3-5"]}
        self.semaphores = {/* per-feature */}

    async def call(self, messages, feature: str, system=None):
        # Bulkhead
        async with self.semaphores[feature]:
            # Fallback chain with circuit breakers
            for config in FALLBACK_CHAIN:
                model = config["model"]
                if not self.breakers[model].can_attempt():
                    continue

                try:
                    # Exponential backoff
                    result = await self._call_with_backoff(
                        model=model, messages=messages, system=system
                    )
                    self.breakers[model].record_success()
                    return result
                except APIError:
                    self.breakers[model].record_failure()
                    continue

            # All paths exhausted → graceful degradation
            return await self._degraded_response(messages)

This is what real production code looks like — boring, resilient, observable.

Monitoring + Alerting

import logging
from collections import defaultdict

metrics = {
    "calls_total": defaultdict(int),
    "calls_failed": defaultdict(int),
    "fallback_used": defaultdict(int),
    "circuit_opened": defaultdict(int)
}

def log_call(model: str, status: str):
    metrics["calls_total"][model] += 1
    if status == "failed":
        metrics["calls_failed"][model] += 1

# Alert if fallback rate > 5% in last hour
def check_health():
    for model in metrics["calls_total"]:
        total = metrics["calls_total"][model]
        failed = metrics["calls_failed"][model]
        if total > 100 and failed / total > 0.05:
            send_alert(f"{model} failure rate: {failed/total:.1%}")

Integrate with Datadog, Sentry, or PostHog. See Claude API Production Checklist for the full monitoring stack.

Frequently Asked Questions

Should I always fall back from Sonnet to Haiku?

Depends on task. For chat, summarization, classification: yes — Haiku is good enough. For code generation, complex reasoning: serving degraded response or retry-later is better than wrong Haiku output. Make fallback per-feature.

How long should circuit breaker cooldown be?

Start at 60s. Match it to Anthropic's typical recovery time. Too short (10s) → false-opens that hurt UX. Too long (5min) → unnecessary downtime. Tune based on observed recovery patterns.

Why not just use exponential backoff and skip the circuit breaker?

Backoff retries forever. Circuit breaker stops trying entirely after threshold. Both are needed: backoff for transient errors, circuit for sustained outages. Without the circuit, a 5-minute outage = thousands of wasted retry calls.

Does Anthropic provide an SLA?

Anthropic publishes 99.9% uptime targets but no formal SLA on standard tier. Enterprise customers get contractual SLAs. Either way: design for outages because they happen.

Can I implement these in n8n or Zapier?

Partially. n8n's Wait + IF nodes can do basic backoff and circuit logic but lack the sophistication of code-level implementations. For mission-critical Claude features, write a thin proxy service in Python/TypeScript and call THAT from n8n. See Claude API + n8n.

Master Production Claude API

Claude Agent SDK Cookbook ($79) — 40 production recipes with resilience patterns baked in: fallback, circuit breaker, observability, cost guardrails. Battle-tested across 30+ deployments.