Claude API: Streaming vs Batch — Which Saves More (2026)
Batch API is 50% cheaper than real-time inference. Streaming adds perceived speed at zero cost premium. The decision rule is simple: if the user is watching, stream. If no one is waiting, batch. Most teams get this backwards — running batch jobs interactively and real-time jobs overnight — and pay 2–5× more than they should.
The cost difference, in one table
| Mode | Input / 1M | Output / 1M | When billed |
|---|---|---|---|
| Real-time (default) | $3.00 | $15.00 | Per request, immediately |
| Streaming | $3.00 | $15.00 | Same as real-time |
| Batch API | $1.50 | $7.50 | After batch completes |
Streaming costs identical to non-streaming real-time — you pay the same per token whether you receive them word-by-word or all at once. The only difference is latency UX.
Batch API is exactly 50% off across all models and all token types. Cache read discounts stack on top.
A $1,000/month real-time workload costs $500/month on Batch API, if the task doesn't require an immediate response.
When streaming makes sense
Streaming delivers tokens to the client as they are generated. The model is still running the same computation — you just receive results incrementally instead of waiting for completion.
Use streaming when:
- A human is reading the output in real time (chat, live completions, code editor autocomplete)
- Time-to-first-token (TTFT) matters to user experience — even a 2-second delay feels slow in interactive contexts
- You're building a UI that should feel responsive (streaming at 30–80 tokens/sec feels fast; waiting 15 seconds for 500 tokens feels broken)
- Multi-turn conversation where the user replies mid-response
What streaming does NOT do:
- Reduce total latency for the full response (the model generates at the same speed)
- Lower cost (same price as non-streaming)
- Improve output quality
The win is purely perceptual: TTFT drops from ~2–15s (full-response wait) to ~0.3–0.8s (first token arrives quickly).
When Batch API makes sense
Batch API queues requests and processes them asynchronously, returning results within 24 hours (typically 1–4 hours in practice for standard workloads).
Use Batch API when:
- No user is waiting for the result right now
- Processing large volumes overnight or on a schedule (nightly classification, daily report generation, weekly data enrichment)
- Generating content assets in advance (product descriptions, alt text, email drafts queued for review)
- Running evals against a labeled dataset — 100+ samples run cheaply as a batch
- Any pipeline step that feeds the next step hours later, not seconds later
What Batch API does NOT do:
- Return results in real time (24h SLA, typically 1–4h)
- Support streaming output (results arrive as a completed file, not incremental tokens)
- Support multi-turn conversation per request (each batch item is a single-turn prompt+response)
Illustrative cost comparison
Three illustrative workloads based on published Anthropic pricing, May 2026:
1 — Email draft generation
Task: Generate first-draft replies to 25,000 customer support tickets/month. Avg 1,500 input, 250 output tokens.
| Mode | Cost/month | Latency |
|---|---|---|
| Streaming (real-time) | $125 | 2–3s TTFT |
| Batch API | $62 | Results available next morning |
Winner: Hybrid. Simple tickets (FAQ-answerable) → streaming for the agent to present immediately. Complex tickets (needing research) → batch overnight. Net: ~$80/month at full quality.
2 — Product description generation
Task: Generate alt text and short descriptions for 10,000 new product listings weekly. 800 input, 150 output tokens.
| Mode | Cost/month | Notes |
|---|---|---|
| Real-time | $54 | No user is waiting |
| Batch API | $27 | Run nightly, results in catalog by morning |
Winner: Batch. Descriptions don't need to be ready in seconds — they need to be ready before the product goes live. Batch saves 50% with no quality or UX tradeoff.
3 — Eval runs
Task: Score 500 model outputs against a rubric for a weekly regression eval.
| Mode | Cost/run | Time |
|---|---|---|
| Real-time | $0.90 | ~12 minutes |
| Batch API | $0.45 | ~45 minutes |
Winner: Batch. Evals run overnight in CI — 45 minutes is fine. Saves 50% on every regression run.
The Batch API in practice
Python: submit a batch
import anthropic
import json
client = anthropic.Anthropic()
# Build batch requests
requests = [
{
"custom_id": f"item-{i}",
"params": {
"model": "claude-sonnet-4-5",
"max_tokens": 256,
"messages": [{"role": "user", "content": f"Summarize: {text}"}]
}
}
for i, text in enumerate(texts) # texts = your list of inputs
]
# Submit
batch = client.messages.batches.create(requests=requests)
print(f"Batch ID: {batch.id}")
print(f"Status: {batch.processing_status}")
Poll and retrieve results
import time
def wait_for_batch(batch_id: str, poll_interval: int = 60) -> list:
while True:
batch = client.messages.batches.retrieve(batch_id)
if batch.processing_status == "ended":
break
print(f"Status: {batch.processing_status} — waiting {poll_interval}s")
time.sleep(poll_interval)
# Stream results
results = []
for result in client.messages.batches.results(batch_id):
if result.result.type == "succeeded":
text = result.result.message.content[0].text
results.append({"id": result.custom_id, "output": text})
else:
results.append({"id": result.custom_id, "error": result.result.error})
return results
TypeScript
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
async function submitBatch(texts: string[]): Promise<string> {
const requests = texts.map((text, i) => ({
custom_id: `item-${i}`,
params: {
model: "claude-sonnet-4-5" as const,
max_tokens: 256,
messages: [{ role: "user" as const, content: `Summarize: ${text}` }]
}
}));
const batch = await client.messages.batches.create({ requests });
return batch.id;
}
async function retrieveBatch(batchId: string) {
// Poll until done
while (true) {
const batch = await client.messages.batches.retrieve(batchId);
if (batch.processing_status === "ended") break;
await new Promise(r => setTimeout(r, 60_000));
}
const results = [];
for await (const result of await client.messages.batches.results(batchId)) {
if (result.result.type === "succeeded") {
results.push({ id: result.custom_id, output: result.result.message.content[0] });
}
}
return results;
}
The decision rule
Ask one question: Is a user waiting for this response right now?
User watching?
├── YES → Stream (real-time, streaming enabled)
│ └── Why: TTFT matters; cost is the same as non-streaming
└── NO → Batch
├── Volume > 100 requests? → Definitely batch
├── Results needed within minutes? → Real-time (non-streaming)
└── Results can wait hours? → Batch (50% savings)
Streaming vs non-streaming (same cost): The only reason to choose non-streaming over streaming is implementation simplicity. If you're not displaying tokens progressively, there's no downside to streaming — but no benefit either.
Stacking discounts: Batch + Cache
Prompt caching and Batch API stack multiplicatively:
| Optimization | Savings |
|---|---|
| Batch API alone | 50% off input + output |
| Prompt caching alone | 90% off cached input tokens |
| Both together | ~70–80% off total cost |
Example: 50,000 requests/month with a 5,000-token system prompt (cached), 1,000-token unique input, 300-token output on Sonnet.
Without optimizations: (5000 + 1000) × $3/1M × 50000 + 300 × $15/1M × 50000 = $900 + $225 = $1,125/month
With cache + batch:
- System prompt cached: 5,000 × $0.30/1M × 50,000 = $75
- Unique input, batch: 1,000 × $1.50/1M × 50,000 = $75
- Output, batch: 300 × $7.50/1M × 50,000 = $112.50
- Total: $262.50/month — 77% less than baseline
→ Full cost optimization playbook
Frequently Asked Questions
Can I use streaming with the Batch API?
No. Batch API returns results as a completed file after processing ends — not incrementally. If you need streaming, use real-time inference.
What's the actual time-to-results for Batch API?
Anthropic's SLA is 24 hours, but in practice most batches with standard volumes (under 100K requests) complete in 1–4 hours. For scheduling purposes, "available by morning if submitted overnight" is a reliable rule.
Does Batch API support all models?
Yes — Haiku, Sonnet, and Opus are all available in the Batch API. The 50% discount applies uniformly across all models.
What's the maximum batch size?
10,000 requests per batch, or 32 MB total request size, whichever is hit first. For larger jobs, split into multiple batches and track by custom_id.
Can I cancel a batch mid-run?
Yes: client.messages.batches.cancel(batch_id). Requests already processed are billed; unprocessed requests are not. The batch status moves to canceling then ended.
Does streaming reduce my API bill?
No. Streaming vs non-streaming real-time inference costs the same. The only way to reduce cost via request mode is the Batch API (50% off) or prompt caching (up to 90% off input tokens).
See also
- Claude API Cost Optimization Guide — full cost reduction playbook: caching, batching, model tiering, token compression
- Prompt Caching Cost Benchmark: $487→$52 in Real Numbers (2026) — measured production savings from caching
- Claude API Streaming Guide — implementation patterns for SSE streaming with Python and TypeScript
Take It Further
Claude API Cost Optimization Masterclass ($59) — The complete cost reduction playbook: streaming architecture, Batch API routing, prompt caching, model tiering, and token compression. 12 optimization scenarios analyzed. Includes the Excel calculator for batch vs streaming break-even.
→ Get Cost Optimization Masterclass — $59
30-day money-back guarantee. Instant download.