← All guides

Claude API: Streaming vs Batch — Which Saves More (2026)

Streaming vs Batch API for Claude: latency tradeoffs, cost math, and the decision rule for routing workloads. With estimated throughput and $ per 1M.

Claude API: Streaming vs Batch — Which Saves More (2026)

Batch API is 50% cheaper than real-time inference. Streaming adds perceived speed at zero cost premium. The decision rule is simple: if the user is watching, stream. If no one is waiting, batch. Most teams get this backwards — running batch jobs interactively and real-time jobs overnight — and pay 2–5× more than they should.


The cost difference, in one table

Mode Input / 1M Output / 1M When billed
Real-time (default) $3.00 $15.00 Per request, immediately
Streaming $3.00 $15.00 Same as real-time
Batch API $1.50 $7.50 After batch completes

Streaming costs identical to non-streaming real-time — you pay the same per token whether you receive them word-by-word or all at once. The only difference is latency UX.

Batch API is exactly 50% off across all models and all token types. Cache read discounts stack on top.

A $1,000/month real-time workload costs $500/month on Batch API, if the task doesn't require an immediate response.


When streaming makes sense

Streaming delivers tokens to the client as they are generated. The model is still running the same computation — you just receive results incrementally instead of waiting for completion.

Use streaming when:

What streaming does NOT do:

The win is purely perceptual: TTFT drops from ~2–15s (full-response wait) to ~0.3–0.8s (first token arrives quickly).


When Batch API makes sense

Batch API queues requests and processes them asynchronously, returning results within 24 hours (typically 1–4 hours in practice for standard workloads).

Use Batch API when:

What Batch API does NOT do:


Illustrative cost comparison

Three illustrative workloads based on published Anthropic pricing, May 2026:

1 — Email draft generation

Task: Generate first-draft replies to 25,000 customer support tickets/month. Avg 1,500 input, 250 output tokens.

Mode Cost/month Latency
Streaming (real-time) $125 2–3s TTFT
Batch API $62 Results available next morning

Winner: Hybrid. Simple tickets (FAQ-answerable) → streaming for the agent to present immediately. Complex tickets (needing research) → batch overnight. Net: ~$80/month at full quality.

2 — Product description generation

Task: Generate alt text and short descriptions for 10,000 new product listings weekly. 800 input, 150 output tokens.

Mode Cost/month Notes
Real-time $54 No user is waiting
Batch API $27 Run nightly, results in catalog by morning

Winner: Batch. Descriptions don't need to be ready in seconds — they need to be ready before the product goes live. Batch saves 50% with no quality or UX tradeoff.

3 — Eval runs

Task: Score 500 model outputs against a rubric for a weekly regression eval.

Mode Cost/run Time
Real-time $0.90 ~12 minutes
Batch API $0.45 ~45 minutes

Winner: Batch. Evals run overnight in CI — 45 minutes is fine. Saves 50% on every regression run.


The Batch API in practice

Python: submit a batch

import anthropic
import json

client = anthropic.Anthropic()

# Build batch requests
requests = [
    {
        "custom_id": f"item-{i}",
        "params": {
            "model": "claude-sonnet-4-5",
            "max_tokens": 256,
            "messages": [{"role": "user", "content": f"Summarize: {text}"}]
        }
    }
    for i, text in enumerate(texts)  # texts = your list of inputs
]

# Submit
batch = client.messages.batches.create(requests=requests)
print(f"Batch ID: {batch.id}")
print(f"Status: {batch.processing_status}")

Poll and retrieve results

import time

def wait_for_batch(batch_id: str, poll_interval: int = 60) -> list:
    while True:
        batch = client.messages.batches.retrieve(batch_id)
        
        if batch.processing_status == "ended":
            break
        
        print(f"Status: {batch.processing_status} — waiting {poll_interval}s")
        time.sleep(poll_interval)
    
    # Stream results
    results = []
    for result in client.messages.batches.results(batch_id):
        if result.result.type == "succeeded":
            text = result.result.message.content[0].text
            results.append({"id": result.custom_id, "output": text})
        else:
            results.append({"id": result.custom_id, "error": result.result.error})
    
    return results

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function submitBatch(texts: string[]): Promise<string> {
  const requests = texts.map((text, i) => ({
    custom_id: `item-${i}`,
    params: {
      model: "claude-sonnet-4-5" as const,
      max_tokens: 256,
      messages: [{ role: "user" as const, content: `Summarize: ${text}` }]
    }
  }));

  const batch = await client.messages.batches.create({ requests });
  return batch.id;
}

async function retrieveBatch(batchId: string) {
  // Poll until done
  while (true) {
    const batch = await client.messages.batches.retrieve(batchId);
    if (batch.processing_status === "ended") break;
    await new Promise(r => setTimeout(r, 60_000));
  }
  
  const results = [];
  for await (const result of await client.messages.batches.results(batchId)) {
    if (result.result.type === "succeeded") {
      results.push({ id: result.custom_id, output: result.result.message.content[0] });
    }
  }
  return results;
}

The decision rule

Ask one question: Is a user waiting for this response right now?

User watching?
├── YES → Stream (real-time, streaming enabled)
│   └── Why: TTFT matters; cost is the same as non-streaming
└── NO → Batch
    ├── Volume > 100 requests? → Definitely batch
    ├── Results needed within minutes? → Real-time (non-streaming)
    └── Results can wait hours? → Batch (50% savings)

Streaming vs non-streaming (same cost): The only reason to choose non-streaming over streaming is implementation simplicity. If you're not displaying tokens progressively, there's no downside to streaming — but no benefit either.


Stacking discounts: Batch + Cache

Prompt caching and Batch API stack multiplicatively:

Optimization Savings
Batch API alone 50% off input + output
Prompt caching alone 90% off cached input tokens
Both together ~70–80% off total cost

Example: 50,000 requests/month with a 5,000-token system prompt (cached), 1,000-token unique input, 300-token output on Sonnet.

Without optimizations: (5000 + 1000) × $3/1M × 50000 + 300 × $15/1M × 50000 = $900 + $225 = $1,125/month

With cache + batch:

Full cost optimization playbook


Frequently Asked Questions

Can I use streaming with the Batch API?

No. Batch API returns results as a completed file after processing ends — not incrementally. If you need streaming, use real-time inference.

What's the actual time-to-results for Batch API?

Anthropic's SLA is 24 hours, but in practice most batches with standard volumes (under 100K requests) complete in 1–4 hours. For scheduling purposes, "available by morning if submitted overnight" is a reliable rule.

Does Batch API support all models?

Yes — Haiku, Sonnet, and Opus are all available in the Batch API. The 50% discount applies uniformly across all models.

What's the maximum batch size?

10,000 requests per batch, or 32 MB total request size, whichever is hit first. For larger jobs, split into multiple batches and track by custom_id.

Can I cancel a batch mid-run?

Yes: client.messages.batches.cancel(batch_id). Requests already processed are billed; unprocessed requests are not. The batch status moves to canceling then ended.

Does streaming reduce my API bill?

No. Streaming vs non-streaming real-time inference costs the same. The only way to reduce cost via request mode is the Batch API (50% off) or prompt caching (up to 90% off input tokens).


See also


Take It Further

Claude API Cost Optimization Masterclass ($59) — The complete cost reduction playbook: streaming architecture, Batch API routing, prompt caching, model tiering, and token compression. 12 optimization scenarios analyzed. Includes the Excel calculator for batch vs streaming break-even.

→ Get Cost Optimization Masterclass — $59

30-day money-back guarantee. Instant download.

AI Disclosure: Written with Claude Code. Pricing from Anthropic's May 2026 published rates.

Tools and references