Claude API Production Checklist: 25 Things Before You Ship
Before going to production with Claude API, you need six things locked down: API keys stored in environment variables (never hardcoded), a budget alert so a spike doesn't surprise you, retry logic with exponential backoff so transient 529s don't crash your app, token-level logging for cost visibility, input validation to prevent prompt injection, and prompt caching enabled on any system prompt longer than 1,024 tokens. That is the short answer. The long answer is 25 checklist items across six categories — security, cost controls, reliability, observability, content safety, and performance — each with the reasoning you need to implement them correctly.
The items below are ordered within each category by failure frequency: the ones teams skip most often come first.
Category 1: Security (5 items)
Leaked API keys are the fastest way to a five-figure bill you didn't plan. Security issues in this section take minutes to introduce and hours to discover.
-
1. API key in environment variables, not source code. Never write
api_key="sk-ant-..."in your application code. Store it in.envlocally (gitignored) and in your platform's secret manager in production. A key committed to GitHub, even briefly, will be scraped by bots within seconds. Rotate immediately if it leaks. -
2. Separate keys per environment (dev / staging / prod). A staging key that hits a per-key budget cap won't take down production, and a prod key that leaks from a staging repo is still a prod leak. Create one key per environment in the Anthropic Console and name them clearly (
prod-api-key,staging-api-key). -
3. Key rotation policy documented and tested. Key rotation is a fire drill — if you've never done it, it will go wrong at 2 AM. Document the steps: create new key → update secret manager → redeploy → verify → delete old key. Test the rotation on staging before you need it in production. For high-value applications, rotate every 90 days.
-
4. API key never logged, never in query strings. Check every logging statement and every HTTP client for accidental key exposure. Query-string parameters land in server logs, CDN logs, and browser history. Use the
Authorization: Bearerheader exclusively. Grep forsk-antin your codebase and CI pipeline before every deploy. -
5. Principle of least privilege for service accounts. If your application only calls the Messages API, the service account or IAM role running it should have no other permissions. A compromised application should not be able to access your database credentials, your DNS settings, or your billing console via a lateral move. Scope secrets explicitly.
Category 2: Cost Controls (5 items)
Claude API cost spikes are real and they happen fast. A single runaway loop calling Opus can burn through a monthly budget in an hour. These five items put hard and soft limits in place before that happens.
-
6. Budget alert configured in Anthropic Console. Go to Console → Settings → Billing → Budget and Alerts. Set a monthly alert at 80% of your expected spend and a hard limit at 120%. This is a five-minute task that has saved many teams from four-figure surprises. Without it, you have no visibility until the invoice arrives.
-
7. Model selection justified per task, not defaulted to Opus. Every API call in your codebase should have an explicit model choice with a written reason.
claude-opus-4-7for document summarization is almost always wrong — Haiku handles it at 1/5 the cost. See the Haiku vs Sonnet vs Opus decision guide for a task-by-task breakdown. Start with the 80/15/5 rule: 80% Haiku, 15% Sonnet, 5% Opus. -
8. Prompt caching enabled on system prompts ≥ 1,024 tokens. If your system prompt is long and static — product context, instructions, few-shot examples — add
"cache_control": {"type": "ephemeral"}to the last static block. Cache reads cost 90% less than cache writes. A 2,000-token system prompt repeated across 10,000 requests saves roughly $27/month on Sonnet. See the architecture guide for implementation. -
9. Batch API used for all non-interactive workloads. Batch API is 50% off all model prices. If a job doesn't need a real-time response — nightly document processing, bulk classification, report generation — use
anthropic.messages.batches.create()instead of the synchronous API. The only cost is async handling logic, which is straightforward with a polling loop or a webhook. -
10. Max tokens capped per request. Always set
max_tokensexplicitly. Without it, a prompt that encourages long output can generate 4,096+ token responses when you expected 200. For classification tasks, cap at 50. For summaries, cap at 500. For chat, cap at 2,048. A single uncapped request is usually harmless; a loop of them is expensive.
Spending more than $200/month on Claude API? The P5 Cost Optimization Masterclass covers advanced caching strategies, model routing decision trees, Batch API patterns, and a worked case study showing how one team cut their bill by 85% in 6 weeks. Get it here →
Category 3: Reliability (5 items)
The Claude API is highly available, but it is not infallible. Overload errors (529), rate limit errors (429), and transient network failures happen in every production system. These items ensure your application handles them gracefully rather than surfacing raw errors to users.
-
11. Retry logic with exponential backoff on 429 and 529 errors. The Anthropic Python and Node SDKs include automatic retries with backoff by default (up to 2 retries). Verify this is enabled and raise the retry count to 3–5 for user-facing paths. For background jobs, use longer backoff: start at 1s, double each attempt, cap at 60s. Never retry immediately — it makes overload worse.
-
12. Timeout set on every API call. Default HTTP timeouts vary by client and are often too long for user-facing applications. Set an explicit 30s timeout on interactive calls and 120s on batch-style calls. A request that hangs for 5 minutes while a user waits is worse than a clean error that triggers a retry. Use
httpx_clientkwargs or SDK timeout options. -
13. Circuit breaker pattern for dependent services. If your application calls Claude API as part of a larger workflow, a Claude outage should not cascade into a full application outage. Implement a simple circuit breaker: after 5 consecutive failures, stop sending requests for 60 seconds and serve a cached or degraded response. Open-circuit behavior should be tested explicitly before launch. See the production architecture patterns for a Python implementation.
-
14. Fallback model defined for critical paths. For user-facing features, define a fallback: if Sonnet is unavailable, try Haiku. If Haiku returns a 529, serve a cached response or a static message. This requires your model selection logic to be parameterized (no hardcoded model strings) and your cache to have a TTL that covers brief outages. Document the fallback behavior in your runbook.
-
15. Idempotency keys on payment-adjacent or state-changing calls. If an API call triggers a side effect (records a database entry, sends an email, charges a user), network failures can cause duplicate calls. Use idempotency keys or check for existing records before processing. This is not specific to Claude API, but it is commonly overlooked when adding AI features to existing systems. For agent workloads where tool calls create side effects, see the production agent deployment guide for a complete treatment of this pattern.
Category 4: Observability (5 items)
You cannot optimize what you cannot measure. These five items give you the visibility needed to debug issues, control costs, and improve quality over time. See also: cost optimization case study.
-
16. Token usage logged per request (input, output, cache hit/miss). The API response includes
usage.input_tokens,usage.output_tokens,usage.cache_read_input_tokens, andusage.cache_creation_input_tokens. Log all four with a request ID, timestamp, model name, and feature name. This is the minimum needed to track cost attribution and catch runaway token usage. Without it, you're flying blind on your bill. -
17. Latency tracked end-to-end, not just API round-trip. Measure time from user request to response rendered, not just the
anthropic.messages.create()call. Prompt construction, post-processing, and database writes add latency that can make a fast API call feel slow. Use structured logging or an APM tool (Datadog, Sentry, Honeycomb) to see the full picture. P95 and P99 latency matter more than averages for user experience. -
18. Error rate monitored with alerting on sustained degradation. Set an alert if your error rate on Claude API calls exceeds 1% over a 5-minute window. A brief spike is normal; sustained errors at that rate indicate a configuration problem, a rate limit being hit continuously, or a platform incident. Include the HTTP status code and error type in your logs so you can distinguish 429s (your traffic) from 529s (platform load) from 4xx errors (your requests).
-
19. Per-feature cost attribution in your metrics. Aggregate token usage by feature name (e.g.,
summarization,chat,extraction). This is the only way to know which feature is driving your bill and which optimization will have the most impact. Add afeaturetag to every API call's metadata. Teams that skip this spend weeks guessing before cutting costs. -
20. Sampling strategy for long-term cost vs. logging volume. At high volume, logging every token count is cheap. Logging full prompts and responses is not — it generates enormous data volumes and can create compliance issues. Decide upfront: log full request/response for errors only, or for a 5% sample. Log token counts and latency for 100% of requests. Document this decision in your runbook.
Category 5: Content Safety (3 items)
These items apply to any application where end users can provide input that reaches the Claude API. They matter more for consumer-facing products and less for internal tooling with trusted users, but skipping them entirely is rarely appropriate.
-
21. Input validation and length limits before the API call. Validate user input before it reaches your prompt construction layer. Enforce a maximum input length (e.g., 10,000 characters for a chat message). Check for obvious injection patterns if your application has sensitive system prompt logic. This is not a replacement for Claude's built-in safety, but it reduces unnecessary API spend and prevents trivially abusive inputs.
-
22. Output filtering for your application's specific risk surface. Claude has strong built-in safety filters, but your application may have domain-specific requirements: no competitor names, no legal advice, no medical diagnoses. Implement a post-processing step that checks responses against your specific rules before returning them to users. A simple regex list is often enough for clear cases; a secondary classifier call (Haiku is cheap) handles nuanced ones.
-
23. Per-user rate limiting to prevent abuse. Implement request rate limiting at the user level, not just the application level. A single user who sends 500 requests in a minute can hit your API-level rate limits and degrade service for everyone else. Use a Redis counter or a simple token bucket. Typical limits: 20 requests/minute per free-tier user, 60 for paid. Adjust based on your observed p99 usage.
Category 6: Performance (2 items)
These items improve response quality from the user's perspective without changing what you send to the API. They are the last items on this checklist because they require security, cost, reliability, and observability to be in place first.
-
24. Streaming enabled for interactive, user-facing responses. If users are waiting for a response that takes more than 2 seconds to generate, use streaming (
stream=Truein Python,stream: truein Node). Stream the response token-by-token to the client. This transforms the user experience from "waiting for a wall of text" to "watching it type" — perceived latency drops dramatically even when total time is the same. -
25. Context window managed explicitly; old messages pruned. For multi-turn conversations, context accumulates. A 50-turn conversation can easily exceed 100K tokens, and you pay for every input token on every request. Implement a context management strategy: sliding window (keep last N turns), summarization (compress old turns into a summary), or importance-based pruning. Left unmanaged, context cost grows linearly with conversation length and can dominate your bill.
Summary: The Minimum Viable Production Checklist
If you are under time pressure and need the essentials, these seven items cover the highest-probability failures:
- API key in environment variables
- Budget alert at 80% of expected spend
- Retry logic with exponential backoff on 429/529
- Token usage logged per request
- Max tokens capped on every API call
- Per-user rate limiting
- Streaming enabled for interactive UI
The other 18 items are important, but the above seven prevent the most common production fires in the first 30 days.
Going deeper on cost optimization? The P5 Cost Optimization Masterclass includes a complete prompt caching implementation guide, the Batch API pattern library, model routing spreadsheets, and the full case study behind the 85%-cost-reduction story. If you're spending — or planning to spend — over $100/month on Claude API, it pays for itself quickly. Get the masterclass →
Frequently Asked Questions
What is the most common mistake teams make when going to production with Claude API?
The most common mistake is defaulting every API call to Opus without a documented reason. Teams build with Opus during development because quality is highest, then ship to production without revisiting model selection. The result is 3–5x higher costs than necessary. The fix is simple: audit every model= parameter in your codebase, write down why each one is justified, and replace any Opus call that can't be defended with Sonnet or Haiku. See the model selection guide for a task-by-task breakdown.
How do I handle Claude API downtime in a production application?
Design for degradation, not perfection. The core pattern has three layers: (1) retries with exponential backoff for transient errors (usually resolves in 10–60 seconds), (2) a circuit breaker that stops sending requests during sustained outages and serves a cached or static response, and (3) a fallback model — if Sonnet is unavailable, try Haiku. Claude's historical uptime is high, but "high" is not "perfect," and user-facing applications should handle a 5-minute outage without surfacing an error page. The production architecture guide has circuit breaker code you can adapt.
How do I monitor Claude API costs without a dedicated observability stack?
Start with two things: the Anthropic Console's built-in usage dashboard (updated daily), and structured logging of usage.input_tokens and usage.output_tokens in your application. Even a CSV of daily token counts by feature gives you enough signal to spot anomalies. Once you are past $50/month, integrate with a proper APM tool — Datadog, Sentry, or even a simple Grafana dashboard — and set alerts on daily spend. The cost case study shows how one team went from zero observability to full cost attribution in a weekend.
Is the Batch API worth the additional implementation complexity?
Yes, for any workload that is not interactive. The Batch API is 50% off all model prices with a 24-hour completion window. The implementation is roughly 30 lines of code: create a batch, poll for completion, retrieve results. For a team spending $300/month on nightly processing jobs, that is $150/month back for a one-time engineering investment. The complexity is low and the savings are immediate. The only case where it is not worth it is if your business logic genuinely requires a synchronous response.
What should I do if my API key is accidentally committed to a public repository?
Act immediately: (1) revoke the key in the Anthropic Console — do not wait, do this first; (2) create a replacement key; (3) update all environments that used the old key; (4) check your Anthropic billing dashboard for unexpected usage in the past 24–48 hours; (5) use git filter-repo or GitHub's secret scanning tools to remove the key from your git history. Assume the key was compromised the moment it was public. Bots scan GitHub continuously for API keys across all providers.