Claude API Production Architecture: Patterns for Scalable Applications
A production Claude API architecture needs four layers: a request queue for rate limit management, a caching layer for repeated prompts, a cost monitor with circuit breakers, and a fallback chain for reliability. Each layer addresses a distinct failure mode — 429 errors, redundant API spend, runaway costs, and downstream outages. This guide covers the implementation patterns for all four, with Python code you can drop into an existing application.
Architecture overview
Before writing a single line of code, it helps to see how the layers connect. Every production Claude API application should pass requests through this sequence:
Client Request
↓
[Rate Limiter + Queue] ← prevents 429s
↓
[Cache Check] ← skip API for repeated prompts
↓ (cache miss)
[Model Router] ← Haiku/Sonnet/Opus decision
↓
[Claude API]
↓
[Response Cache] ← store for future cache hits
↓
[Cost Monitor] ← track tokens, alert on spikes
↓
Client Response
Each layer is independently testable. The rate limiter does not need to know about the cache. The cost monitor does not need to know about the model router. This separation matters when you need to swap one component — for example, replacing an in-process queue with Redis without touching your caching logic.
Layer 1: Request queue with rate limit management
Claude API rate limits operate on two axes: requests per minute (RPM) and tokens per minute (TPM). Exceeding either returns a 429 RateLimitError. The naive fix is exponential backoff on the error, but that adds latency after the fact. A proper queue prevents the 429 from occurring.
The implementation below uses asyncio to queue requests and enforce both limits in-process. For multi-process or multi-host deployments, replace the in-memory deques with Redis sorted sets.
import asyncio
from collections import deque
import time
class RateLimitedQueue:
def __init__(self, requests_per_minute: int = 50, tokens_per_minute: int = 40_000):
self.rpm_limit = requests_per_minute
self.tpm_limit = tokens_per_minute
self.request_times = deque()
self.token_counts = deque()
self.semaphore = asyncio.Semaphore(10) # max concurrent requests
async def acquire(self, estimated_tokens: int = 1000):
async with self.semaphore:
now = time.time()
# Clean old entries
minute_ago = now - 60
while self.request_times and self.request_times[0] < minute_ago:
self.request_times.popleft()
self.token_counts.popleft()
# Wait if at limit
while len(self.request_times) >= self.rpm_limit:
wait = 60 - (now - self.request_times[0])
await asyncio.sleep(wait)
now = time.time()
self.request_times.append(now)
self.token_counts.append(estimated_tokens)
The semaphore caps concurrent requests at 10. The deque-based sliding window tracks both request counts and token estimates over the trailing 60 seconds. Callers pass an estimated_tokens value — use your p90 token count as the default. If you consistently underestimate, the queue will allow too many requests; if you overestimate, you leave throughput on the table.
Wire the queue into your API call wrapper:
queue = RateLimitedQueue(requests_per_minute=50, tokens_per_minute=40_000)
async def call_claude(prompt: str, model: str = "claude-sonnet-4-5") -> str:
await queue.acquire(estimated_tokens=len(prompt) // 4)
response = await async_client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
return response.content[0].text
Layer 2: Semantic caching
Exact-match caching uses a hash of the prompt as the cache key. It works for identical repeated requests — scheduled reports, templated queries, test suites. It misses the larger opportunity: many prompts ask the same question with slightly different phrasing.
Semantic caching embeds the prompt and checks cosine similarity against stored embeddings. If similarity exceeds 0.95, return the cached response. The threshold is the tunable variable — lower it to get more cache hits with slightly less precision, raise it to require near-identical phrasing.
Implementation approach:
- Hash the prompt. Check an in-process dict or Redis for an exact hit first — this is free and fast.
- On exact miss, embed the prompt using a lightweight embedding model (all-MiniLM-L6-v2 runs locally in under 50ms on CPU).
- Query a vector store (pgvector, Pinecone, or Chroma) for the nearest stored embedding. If cosine similarity > 0.95, return the cached response.
- On cache miss, call the Claude API, store the response, and write the embedding to the vector store.
TTL by query type:
| Query type | TTL |
|---|---|
| Factual lookups (definitions, documentation) | 1 hour |
| Code generation | 24 hours |
| Analysis with static inputs | 6 hours |
| Real-time data (prices, availability, live status) | Do not cache |
For code generation, the output is deterministic enough that a 24-hour TTL rarely causes stale responses. For real-time queries, skip the cache layer entirely — detect these by checking for keywords like "current", "today", "latest", or "live" in the prompt before the cache lookup.
Layer 3: Fallback chain
A fallback chain retries failed requests on alternative models. The pattern is most useful when Sonnet is rate-limited but Haiku has capacity, or when you want automatic cost reduction for requests that a smaller model can handle.
async def call_with_fallback(prompt: str) -> str:
models = ["claude-sonnet-4-5", "claude-haiku-4-5"]
for model in models:
try:
return await call_claude(prompt, model=model)
except anthropic.RateLimitError:
if model == models[-1]:
raise
await asyncio.sleep(2)
raise RuntimeError("All models failed")
Keep the sleep between retries short — 2 seconds is enough for the rate limit window to shift slightly. Longer sleeps increase tail latency without meaningfully improving the chance of success.
Extend the fallback chain to handle other error types differently. anthropic.APIStatusError with a 529 status (overloaded) warrants a retry on the same model. anthropic.BadRequestError indicates a prompt problem that retrying will not fix — log it and surface the error to the caller immediately.
Layer 4: Cost monitoring
Cost monitoring has two jobs: tracking spend per request and alerting when something is wrong. Both require logging structured data on every API response.
Log this on every request:
import time
import logging
COST_PER_TOKEN = {
"claude-sonnet-4-5": {"input": 3.00, "output": 15.00, "cache_read": 0.30},
"claude-haiku-4-5": {"input": 0.80, "output": 4.00, "cache_read": 0.08},
}
def log_request(response, model: str, user_id: str, start_time: float):
usage = response.usage
rates = COST_PER_TOKEN[model]
cost_usd = (
usage.input_tokens * rates["input"] / 1_000_000
+ usage.output_tokens * rates["output"] / 1_000_000
+ getattr(usage, "cache_read_input_tokens", 0) * rates["cache_read"] / 1_000_000
)
logging.info({
"event": "claude_request",
"model": model,
"input_tokens": usage.input_tokens,
"output_tokens": usage.output_tokens,
"cached_tokens": getattr(usage, "cache_read_input_tokens", 0),
"cost_usd": round(cost_usd, 6),
"latency_ms": int((time.time() - start_time) * 1000),
"user_id": user_id,
"timestamp": time.time(),
})
return cost_usd
Alert thresholds:
- Single request exceeds $0.10 — likely a prompt injection or runaway context window
- Daily total exceeds $50 — send a Slack or PagerDuty alert before the bill compounds
- Error rate exceeds 5% over a 10-minute window — indicates rate limiting or model availability issues
Use PostHog or Datadog to track cost trends — send the structured log as a custom event and build a dashboard on cost_usd grouped by model and user_id. Anomalies become visible within hours.
Circuit breaker: If the daily total crosses 80% of your budget threshold, route all non-critical requests to Haiku automatically. This prevents a single runaway job from consuming the monthly budget in an afternoon.
Deployment checklist
- Rate limit queue implemented (in-process for single host, Redis for multi-host)
- Prompt caching enabled for system prompts with
cache_control: ephemeral - Cost monitoring + alerts configured (single request, daily total, error rate)
- Fallback to Haiku on
RateLimitErrorfrom Sonnet - Response logging for every request with model, tokens, cost, latency
- Separate API keys per environment (development, staging, production)
- Budget circuit breaker set at 80% of monthly limit
Separate API keys per environment is non-negotiable. A staging environment pointed at the production key consumes production rate limit quota, inflates the cost dashboard, and makes key rotation painful.
Sizing the queue for your traffic
New Anthropic accounts start at 50 RPM and 40,000 TPM on Sonnet. Usage tier 2 (after spending $500) raises these substantially. Size the queue to your current tier, not a future one.
For multi-host deployments, move request_times and token_counts to a Redis sorted set keyed by timestamp. Each host increments the same counters, giving you a global view of rate limit consumption rather than per-host silos that collectively exceed the limit.
Frequently asked questions
What is the most common cause of 429 errors in production Claude API applications?
Burst traffic without a queue. Most applications work fine at average load but send bursts of requests simultaneously — page loads, batch jobs starting at the same time, or a single user triggering a multi-step agent. A sliding window queue absorbs these bursts and smooths them within the rate limit, eliminating most 429s without adding meaningful latency under normal conditions.
Should I use Haiku or Sonnet as the primary model for a production application?
Route by task complexity. Haiku handles classification, summarization of short documents, extraction, and simple Q&A at roughly one-fifth the cost of Sonnet. Sonnet handles multi-step reasoning, code generation, and analysis of long documents. The model router in the architecture above is the right place to implement this logic — inspect the prompt for keywords or token count to decide which model to invoke.
How do I prevent a single user from consuming disproportionate API budget?
Add per-user rate limiting in the queue layer alongside the global limit. A per-user cap of 10 RPM and 5,000 TPM prevents any single account from starving others. Log cost_usd per user_id to identify top spenders for billing or throttling decisions.
Is the in-process queue safe across multiple concurrent web workers?
No. asyncio.Semaphore is single-process. Multiple Gunicorn workers or Kubernetes replicas each maintain independent queue state, and their combined traffic can exceed the rate limit. Replace the in-memory deques with Redis using ZRANGEBYSCORE and ZADD for the sliding window — the logic is identical, only the storage backend changes.
When should I use the Batch API instead of the real-time queue?
Use the Batch API for workloads that do not need a real-time response — nightly processing, bulk analysis, evaluation runs, scheduled reports. It returns results within 24 hours at 50% lower cost. The queue in this guide is for interactive, user-facing requests. The correct production setup uses both: the real-time queue for live requests, and the Batch API for async jobs.
Take It Further
Claude Agent SDK Cookbook: 40 Production Patterns — Pattern 5 covers the complete Production Architecture: the full queue implementation with Redis backend, semantic caching with pgvector, the PostHog cost monitoring integration, and the Kubernetes deployment manifests for scaling Claude workloads horizontally.
→ Get the Agent SDK Cookbook — $49
30-day money-back guarantee. Instant download.