Claude Batch API: 50% Discount for Async Workloads (2026 Guide)
The Claude Message Batches API gives you a flat 50% discount on input and output tokens for any request that can wait up to 24 hours for a response. No architectural changes to your model — you're running the same claude-3-5-haiku or claude-3-5-sonnet, just submitting work in bulk rather than waiting for each response inline.
This is one of the three levers in the Claude API cost reduction stack. Prompt caching handles repeated context. Model tiering handles the request routing. Batch API handles the async workloads you're probably already running synchronously.
What qualifies for Batch API
The core constraint: you submit requests and poll for results. You don't get a streaming response, and you can't wait synchronously for an answer in under a few seconds.
Workloads that fit naturally:
- Nightly report generation — summarize usage logs, generate weekly digests, create reports across user accounts
- Bulk document processing — extract structure from uploaded PDFs, classify support tickets, tag content
- Batch embeddings or classification — categorize a product catalog, run sentiment analysis on customer feedback
- Evaluation runs — test prompt changes against a benchmark dataset
- Data transformation pipelines — reformat, clean, or enrich data at scale
Workloads that don't fit:
- Anything where the user is waiting for a response (chat, autocomplete, search)
- Real-time classification in a request/response flow
- Any pipeline where step N depends on step N-1's result in under a minute
Pricing: the actual numbers
For claude-3-5-sonnet as of April 2026:
| Mode | Input | Output |
|---|---|---|
| Standard | $3.00 / 1M tokens | $15.00 / 1M tokens |
| Batch API | $1.50 / 1M tokens | $7.50 / 1M tokens |
The 50% discount applies equally to input and output tokens. Prompt caching discounts stack on top — if you also use a cached prefix, you get the cache hit rate on the input tokens first, then the batch discount applies to the remainder.
Implementation
The Batch API is a separate endpoint. You submit a list of requests as a single batch, get a batch ID back, and poll until all requests complete.
Submit a batch
import anthropic
import json
client = anthropic.Anthropic()
# Prepare your requests
requests = [
{
"custom_id": f"report-{user_id}",
"params": {
"model": "claude-3-5-haiku-20241022",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": f"Summarize this week's activity for user {user_id}: {activity_data}"
}
]
}
}
for user_id, activity_data in user_activity_map.items()
]
# Submit the batch
batch = client.messages.batches.create(requests=requests)
print(f"Batch ID: {batch.id}")
print(f"Request count: {batch.request_counts.processing}")
Poll for completion
import time
def wait_for_batch(client, batch_id, poll_interval=60):
"""Poll until batch completes. Returns the batch object."""
while True:
batch = client.messages.batches.retrieve(batch_id)
if batch.processing_status == "ended":
return batch
counts = batch.request_counts
print(f"Progress: {counts.succeeded + counts.errored}/{counts.processing + counts.succeeded + counts.errored}")
time.sleep(poll_interval)
batch = wait_for_batch(client, batch.id)
Retrieve results
# Stream results — don't load everything into memory at once
for result in client.messages.batches.results(batch.id):
custom_id = result.custom_id
if result.result.type == "succeeded":
message = result.result.message
content = message.content[0].text
# Process content...
elif result.result.type == "errored":
error = result.result.error
print(f"Request {custom_id} failed: {error}")
TypeScript/Node.js version
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
// Submit
const batch = await client.messages.batches.create({
requests: userIds.map((userId) => ({
custom_id: `report-${userId}`,
params: {
model: "claude-3-5-haiku-20241022",
max_tokens: 1024,
messages: [
{
role: "user",
content: `Generate weekly report for user ${userId}`,
},
],
},
})),
});
// Poll
let status = await client.messages.batches.retrieve(batch.id);
while (status.processing_status !== "ended") {
await new Promise((r) => setTimeout(r, 30_000));
status = await client.messages.batches.retrieve(batch.id);
}
// Results
for await (const result of await client.messages.batches.results(batch.id)) {
if (result.result.type === "succeeded") {
const text = result.result.message.content[0];
if (text.type === "text") {
console.log(result.custom_id, text.text);
}
}
}
Key constraints to know before you ship
24-hour window. Batches expire after 24 hours. In practice most complete in 1–4 hours, but design your polling to handle the full window.
100,000 requests or 256MB per batch. Large batches need splitting. If you're processing 500K documents, that's 5+ separate batch submissions.
No streaming. You get complete responses only, not deltas. If you need to stream for UX reasons, Batch API is not the right tool.
Error handling per-request. Some requests in a batch can fail while others succeed. Always check result.result.type and handle "errored" results explicitly. The most common failure modes are invalid_request_error (your prompt is malformed) and overloaded_error (retry that request).
custom_id uniqueness. The custom_id is how you match results back to inputs. It must be unique within a batch. Using a meaningful ID (like user-123-2026-04-26) makes debugging much easier than UUIDs.
Combining with prompt caching
If you have a long system prompt shared across all batch requests (a document template, detailed instructions, few-shot examples), you can cache it even within batch requests:
requests = [
{
"custom_id": f"doc-{doc_id}",
"params": {
"model": "claude-3-5-sonnet-20241022",
"max_tokens": 2048,
"system": [
{
"type": "text",
"text": long_system_prompt,
"cache_control": {"type": "ephemeral"}
}
],
"messages": [{"role": "user", "content": document_content}]
}
}
for doc_id, document_content in documents.items()
]
The cache write happens on the first request that hits this prefix. Subsequent requests in the same batch get the cache hit discount. Since batches process in parallel, cache hit rate depends on batch size and timing — larger batches get higher hit rates.
The economics on a real workload
Scenario: Nightly report job, 10,000 user summaries, claude-3-5-haiku, 500 input tokens + 300 output tokens per request.
| Metric | Standard | Batch API |
|---|---|---|
| Input tokens | 5M | 5M |
| Output tokens | 3M | 3M |
| Input cost | $4.00 | $2.00 |
| Output cost | $3.75 | $1.87 |
| Total/night | $7.75 | $3.87 |
| Annual savings | — | $1,415 |
For a workload that's already running nightly, switching the endpoint saves over $1,400/year with zero changes to your model, prompts, or logic.
When NOT to use Batch API
- Real-time user-facing features (obvious)
- Debugging — batch results are harder to inspect than synchronous responses
- Small one-off jobs (< 100 requests) where the polling overhead isn't worth it
- Any workload with strict SLAs under 10 minutes
For everything else that runs on a schedule or in the background: switch to Batch API and bank the 50%.
The full cost optimization playbook — including exact implementation order, real production benchmarks, and the 6-tab Excel calculator — is covered in the Claude API Cost Optimization Masterclass. For the break-even math on prompt caching (a complementary technique), see prompt caching break-even analysis.
Frequently Asked Questions
How long does the Claude Batch API take to process requests?
Most batches complete in 1–4 hours, though the maximum window is 24 hours. Anthropic does not guarantee a specific processing time — batch jobs are deprioritized behind real-time traffic. Design your system to poll until completion and handle the full 24-hour window as an edge case.
What is the maximum batch size I can submit?
Each batch is capped at 100,000 requests or 256 MB, whichever comes first. For larger workloads, split your data into multiple batch submissions. There is no limit on how many batches you can have in flight simultaneously.
Can I cancel a batch after submitting it?
Yes. Call client.messages.batches.cancel(batch_id) to cancel a pending or in-progress batch. Requests that have already completed will still return results; only pending requests are cancelled. You will not be charged for requests that were cancelled before processing.
Do Batch API requests count against my real-time rate limits?
No. Batch requests are processed asynchronously on a separate infrastructure path and do not consume your RPM or TPM rate limits. This makes the Batch API particularly useful for high-volume jobs — you can run large batches without affecting your real-time API quota.
Does prompt caching work inside batch requests?
Yes. You can include cache_control blocks in the system prompt of batch requests just as you would in real-time requests. The cache write happens on the first request that encounters a given prefix, and subsequent requests in the same batch benefit from cache hits at the reduced cache-read price, stacking on top of the 50% batch discount.
Take It Further
Claude API Cost Optimization Masterclass — Cut your Claude API bill by 60–90% without sacrificing quality. 12 production deployments analyzed. The concrete order-of-operations: prompt caching, model tiering, Batch API, token compression.
120-page PDF + 6-sheet Excel cost calculator. Real results: $2,100 → $187/month on customer support agent.
→ Get Cost Optimization Masterclass — $59
30-day money-back guarantee. Instant download.