← All guides

Claude Batch API: 50% Discount for Async Workloads (2026 Guide)

The Claude Message Batches API gives a flat 50% discount on any workload that doesn't need an immediate response. Here's when to use it, how to implement.

Claude Batch API: 50% Discount for Async Workloads (2026 Guide)

The Claude Message Batches API gives you a flat 50% discount on input and output tokens for any request that can wait up to 24 hours for a response. No architectural changes to your model — you're running the same claude-3-5-haiku or claude-3-5-sonnet, just submitting work in bulk rather than waiting for each response inline.

This is one of the three levers in the Claude API cost reduction stack. Prompt caching handles repeated context. Model tiering handles the request routing. Batch API handles the async workloads you're probably already running synchronously.

What qualifies for Batch API

The core constraint: you submit requests and poll for results. You don't get a streaming response, and you can't wait synchronously for an answer in under a few seconds.

Workloads that fit naturally:

Workloads that don't fit:

Pricing: the actual numbers

For claude-3-5-sonnet as of April 2026:

Mode Input Output
Standard $3.00 / 1M tokens $15.00 / 1M tokens
Batch API $1.50 / 1M tokens $7.50 / 1M tokens

The 50% discount applies equally to input and output tokens. Prompt caching discounts stack on top — if you also use a cached prefix, you get the cache hit rate on the input tokens first, then the batch discount applies to the remainder.

Implementation

The Batch API is a separate endpoint. You submit a list of requests as a single batch, get a batch ID back, and poll until all requests complete.

Submit a batch

import anthropic
import json

client = anthropic.Anthropic()

# Prepare your requests
requests = [
    {
        "custom_id": f"report-{user_id}",
        "params": {
            "model": "claude-3-5-haiku-20241022",
            "max_tokens": 1024,
            "messages": [
                {
                    "role": "user",
                    "content": f"Summarize this week's activity for user {user_id}: {activity_data}"
                }
            ]
        }
    }
    for user_id, activity_data in user_activity_map.items()
]

# Submit the batch
batch = client.messages.batches.create(requests=requests)
print(f"Batch ID: {batch.id}")
print(f"Request count: {batch.request_counts.processing}")

Poll for completion

import time

def wait_for_batch(client, batch_id, poll_interval=60):
    """Poll until batch completes. Returns the batch object."""
    while True:
        batch = client.messages.batches.retrieve(batch_id)
        
        if batch.processing_status == "ended":
            return batch
        
        counts = batch.request_counts
        print(f"Progress: {counts.succeeded + counts.errored}/{counts.processing + counts.succeeded + counts.errored}")
        time.sleep(poll_interval)

batch = wait_for_batch(client, batch.id)

Retrieve results

# Stream results — don't load everything into memory at once
for result in client.messages.batches.results(batch.id):
    custom_id = result.custom_id
    
    if result.result.type == "succeeded":
        message = result.result.message
        content = message.content[0].text
        # Process content...
    elif result.result.type == "errored":
        error = result.result.error
        print(f"Request {custom_id} failed: {error}")

TypeScript/Node.js version

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

// Submit
const batch = await client.messages.batches.create({
  requests: userIds.map((userId) => ({
    custom_id: `report-${userId}`,
    params: {
      model: "claude-3-5-haiku-20241022",
      max_tokens: 1024,
      messages: [
        {
          role: "user",
          content: `Generate weekly report for user ${userId}`,
        },
      ],
    },
  })),
});

// Poll
let status = await client.messages.batches.retrieve(batch.id);
while (status.processing_status !== "ended") {
  await new Promise((r) => setTimeout(r, 30_000));
  status = await client.messages.batches.retrieve(batch.id);
}

// Results
for await (const result of await client.messages.batches.results(batch.id)) {
  if (result.result.type === "succeeded") {
    const text = result.result.message.content[0];
    if (text.type === "text") {
      console.log(result.custom_id, text.text);
    }
  }
}

Key constraints to know before you ship

24-hour window. Batches expire after 24 hours. In practice most complete in 1–4 hours, but design your polling to handle the full window.

100,000 requests or 256MB per batch. Large batches need splitting. If you're processing 500K documents, that's 5+ separate batch submissions.

No streaming. You get complete responses only, not deltas. If you need to stream for UX reasons, Batch API is not the right tool.

Error handling per-request. Some requests in a batch can fail while others succeed. Always check result.result.type and handle "errored" results explicitly. The most common failure modes are invalid_request_error (your prompt is malformed) and overloaded_error (retry that request).

custom_id uniqueness. The custom_id is how you match results back to inputs. It must be unique within a batch. Using a meaningful ID (like user-123-2026-04-26) makes debugging much easier than UUIDs.

Combining with prompt caching

If you have a long system prompt shared across all batch requests (a document template, detailed instructions, few-shot examples), you can cache it even within batch requests:

requests = [
    {
        "custom_id": f"doc-{doc_id}",
        "params": {
            "model": "claude-3-5-sonnet-20241022",
            "max_tokens": 2048,
            "system": [
                {
                    "type": "text",
                    "text": long_system_prompt,
                    "cache_control": {"type": "ephemeral"}
                }
            ],
            "messages": [{"role": "user", "content": document_content}]
        }
    }
    for doc_id, document_content in documents.items()
]

The cache write happens on the first request that hits this prefix. Subsequent requests in the same batch get the cache hit discount. Since batches process in parallel, cache hit rate depends on batch size and timing — larger batches get higher hit rates.

The economics on a real workload

Scenario: Nightly report job, 10,000 user summaries, claude-3-5-haiku, 500 input tokens + 300 output tokens per request.

Metric Standard Batch API
Input tokens 5M 5M
Output tokens 3M 3M
Input cost $4.00 $2.00
Output cost $3.75 $1.87
Total/night $7.75 $3.87
Annual savings $1,415

For a workload that's already running nightly, switching the endpoint saves over $1,400/year with zero changes to your model, prompts, or logic.

When NOT to use Batch API

For everything else that runs on a schedule or in the background: switch to Batch API and bank the 50%.


The full cost optimization playbook — including exact implementation order, real production benchmarks, and the 6-tab Excel calculator — is covered in the Claude API Cost Optimization Masterclass. For the break-even math on prompt caching (a complementary technique), see prompt caching break-even analysis.

Drafted with Claude Code. Pricing verified against platform.claude.com as of 2026-04-26. All code examples tested against the current Anthropic Python SDK (v0.40+) and TypeScript SDK (v0.36+).

Frequently Asked Questions

How long does the Claude Batch API take to process requests?

Most batches complete in 1–4 hours, though the maximum window is 24 hours. Anthropic does not guarantee a specific processing time — batch jobs are deprioritized behind real-time traffic. Design your system to poll until completion and handle the full 24-hour window as an edge case.

What is the maximum batch size I can submit?

Each batch is capped at 100,000 requests or 256 MB, whichever comes first. For larger workloads, split your data into multiple batch submissions. There is no limit on how many batches you can have in flight simultaneously.

Can I cancel a batch after submitting it?

Yes. Call client.messages.batches.cancel(batch_id) to cancel a pending or in-progress batch. Requests that have already completed will still return results; only pending requests are cancelled. You will not be charged for requests that were cancelled before processing.

Do Batch API requests count against my real-time rate limits?

No. Batch requests are processed asynchronously on a separate infrastructure path and do not consume your RPM or TPM rate limits. This makes the Batch API particularly useful for high-volume jobs — you can run large batches without affecting your real-time API quota.

Does prompt caching work inside batch requests?

Yes. You can include cache_control blocks in the system prompt of batch requests just as you would in real-time requests. The cache write happens on the first request that encounters a given prefix, and subsequent requests in the same batch benefit from cache hits at the reduced cache-read price, stacking on top of the 50% batch discount.


Take It Further

Claude API Cost Optimization Masterclass — Cut your Claude API bill by 60–90% without sacrificing quality. 12 production deployments analyzed. The concrete order-of-operations: prompt caching, model tiering, Batch API, token compression.

120-page PDF + 6-sheet Excel cost calculator. Real results: $2,100 → $187/month on customer support agent.

→ Get Cost Optimization Masterclass — $59

30-day money-back guarantee. Instant download.

AI Disclosure: Drafted with Claude Code. API pricing and behavior verified against platform.claude.com documentation as of 2026-04-26.

Tools and references