Claude API: Streaming vs Batch — Which Saves More (2026)

Q: Can I cancel a batch mid-run?

Yes: client.messages.batches.cancel(batch_id). Requests already processed are billed; unprocessed requests are not. The batch status moves to canceling then ended.

Batch API is 50% cheaper than real-time inference. Streaming adds perceived speed at zero cost premium. The decision rule is simple: if the user is watching, stream. If no one is waiting, batch. Most teams get this backwards — running batch jobs interactively and real-time jobs overnight — and pay 2–5× more than they should.

The cost difference, in one table

Mode	Input / 1M	Output / 1M	When billed
Real-time (default)	$3.00	$15.00	Per request, immediately
Streaming	$3.00	$15.00	Same as real-time
Batch API	$1.50	$7.50	After batch completes

Streaming costs identical to non-streaming real-time — you pay the same per token whether you receive them word-by-word or all at once. The only difference is latency UX.

Batch API is exactly 50% off across all models and all token types. Cache read discounts stack on top.

A $1,000/month real-time workload costs $500/month on Batch API, if the task doesn't require an immediate response.

When streaming makes sense

Streaming delivers tokens to the client as they are generated. The model is still running the same computation — you just receive results incrementally instead of waiting for completion.

Use streaming when:

A human is reading the output in real time (chat, live completions, code editor autocomplete)
Time-to-first-token (TTFT) matters to user experience — even a 2-second delay feels slow in interactive contexts
You're building a UI that should feel responsive (streaming at 30–80 tokens/sec feels fast; waiting 15 seconds for 500 tokens feels broken)
Multi-turn conversation where the user replies mid-response

What streaming does NOT do:

Reduce total latency for the full response (the model generates at the same speed)
Lower cost (same price as non-streaming)
Improve output quality

The win is purely perceptual: TTFT drops from ~2–15s (full-response wait) to ~0.3–0.8s (first token arrives quickly).

When Batch API makes sense

Batch API queues requests and processes them asynchronously, returning results within 24 hours (typically 1–4 hours in practice for standard workloads).

Use Batch API when:

No user is waiting for the result right now
Processing large volumes overnight or on a schedule (nightly classification, daily report generation, weekly data enrichment)
Generating content assets in advance (product descriptions, alt text, email drafts queued for review)
Running evals against a labeled dataset — 100+ samples run cheaply as a batch
Any pipeline step that feeds the next step hours later, not seconds later

What Batch API does NOT do:

Return results in real time (24h SLA, typically 1–4h)
Support streaming output (results arrive as a completed file, not incremental tokens)
Support multi-turn conversation per request (each batch item is a single-turn prompt+response)

Illustrative cost comparison

Three illustrative workloads based on published Anthropic pricing, May 2026:

1 — Email draft generation

Task: Generate first-draft replies to 25,000 customer support tickets/month. Avg 1,500 input, 250 output tokens.

Mode	Cost/month	Latency
Streaming (real-time)	$125	2–3s TTFT
Batch API	$62	Results available next morning

Winner: Hybrid. Simple tickets (FAQ-answerable) → streaming for the agent to present immediately. Complex tickets (needing research) → batch overnight. Net: ~$80/month at full quality.

2 — Product description generation

Task: Generate alt text and short descriptions for 10,000 new product listings weekly. 800 input, 150 output tokens.

Mode	Cost/month	Notes
Real-time	$54	No user is waiting
Batch API	$27	Run nightly, results in catalog by morning

Winner: Batch. Descriptions don't need to be ready in seconds — they need to be ready before the product goes live. Batch saves 50% with no quality or UX tradeoff.

3 — Eval runs

Task: Score 500 model outputs against a rubric for a weekly regression eval.

Mode	Cost/run	Time
Real-time	$0.90	~12 minutes
Batch API	$0.45	~45 minutes

Winner: Batch. Evals run overnight in CI — 45 minutes is fine. Saves 50% on every regression run.

The Batch API in practice

Python: submit a batch

import anthropic
import json

client = anthropic.Anthropic()

# Build batch requests
requests = [
    {
        "custom_id": f"item-{i}",
        "params": {
            "model": "claude-sonnet-4-5",
            "max_tokens": 256,
            "messages": [{"role": "user", "content": f"Summarize: {text}"}]
        }
    }
    for i, text in enumerate(texts)  # texts = your list of inputs
]

# Submit
batch = client.messages.batches.create(requests=requests)
print(f"Batch ID: {batch.id}")
print(f"Status: {batch.processing_status}")

Poll and retrieve results

import time

def wait_for_batch(batch_id: str, poll_interval: int = 60) -> list:
    while True:
        batch = client.messages.batches.retrieve(batch_id)
        
        if batch.processing_status == "ended":
            break
        
        print(f"Status: {batch.processing_status} — waiting {poll_interval}s")
        time.sleep(poll_interval)
    
    # Stream results
    results = []
    for result in client.messages.batches.results(batch_id):
        if result.result.type == "succeeded":
            text = result.result.message.content[0].text
            results.append({"id": result.custom_id, "output": text})
        else:
            results.append({"id": result.custom_id, "error": result.result.error})
    
    return results

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function submitBatch(texts: string[]): Promise<string> {
  const requests = texts.map((text, i) => ({
    custom_id: `item-${i}`,
    params: {
      model: "claude-sonnet-4-5" as const,
      max_tokens: 256,
      messages: [{ role: "user" as const, content: `Summarize: ${text}` }]
    }
  }));

  const batch = await client.messages.batches.create({ requests });
  return batch.id;
}

async function retrieveBatch(batchId: string) {
  // Poll until done
  while (true) {
    const batch = await client.messages.batches.retrieve(batchId);
    if (batch.processing_status === "ended") break;
    await new Promise(r => setTimeout(r, 60_000));
  }
  
  const results = [];
  for await (const result of await client.messages.batches.results(batchId)) {
    if (result.result.type === "succeeded") {
      results.push({ id: result.custom_id, output: result.result.message.content[0] });
    }
  }
  return results;
}

The decision rule

Ask one question: Is a user waiting for this response right now?

User watching?
├── YES → Stream (real-time, streaming enabled)
│   └── Why: TTFT matters; cost is the same as non-streaming
└── NO → Batch
    ├── Volume > 100 requests? → Definitely batch
    ├── Results needed within minutes? → Real-time (non-streaming)
    └── Results can wait hours? → Batch (50% savings)

Streaming vs non-streaming (same cost): The only reason to choose non-streaming over streaming is implementation simplicity. If you're not displaying tokens progressively, there's no downside to streaming — but no benefit either.

Stacking discounts: Batch + Cache

Prompt caching and Batch API stack multiplicatively:

Optimization	Savings
Batch API alone	50% off input + output
Prompt caching alone	90% off cached input tokens
Both together	~70–80% off total cost

Example: 50,000 requests/month with a 5,000-token system prompt (cached), 1,000-token unique input, 300-token output on Sonnet.

Without optimizations: (5000 + 1000) × $3/1M × 50000 + 300 × $15/1M × 50000 = $900 + $225 = $1,125/month

With cache + batch:

System prompt cached: 5,000 × $0.30/1M × 50,000 = $75
Unique input, batch: 1,000 × $1.50/1M × 50,000 = $75
Output, batch: 300 × $7.50/1M × 50,000 = $112.50
Total: $262.50/month — 77% less than baseline

→ Full cost optimization playbook

Frequently Asked Questions

Can I use streaming with the Batch API?

No. Batch API returns results as a completed file after processing ends — not incrementally. If you need streaming, use real-time inference.

What's the actual time-to-results for Batch API?

Anthropic's SLA is 24 hours, but in practice most batches with standard volumes (under 100K requests) complete in 1–4 hours. For scheduling purposes, "available by morning if submitted overnight" is a reliable rule.

Does Batch API support all models?

Yes — Haiku, Sonnet, and Opus are all available in the Batch API. The 50% discount applies uniformly across all models.

What's the maximum batch size?

10,000 requests per batch, or 32 MB total request size, whichever is hit first. For larger jobs, split into multiple batches and track by custom_id.

Can I cancel a batch mid-run?

Yes: client.messages.batches.cancel(batch_id). Requests already processed are billed; unprocessed requests are not. The batch status moves to canceling then ended.

Does streaming reduce my API bill?

No. Streaming vs non-streaming real-time inference costs the same. The only way to reduce cost via request mode is the Batch API (50% off) or prompt caching (up to 90% off input tokens).

Take It Further

Claude API Cost Optimization Masterclass ($59) — The complete cost reduction playbook: streaming architecture, Batch API routing, prompt caching, model tiering, and token compression. 12 optimization scenarios analyzed. Includes the Excel calculator for batch vs streaming break-even.

→ Get Cost Optimization Masterclass — $59

30-day money-back guarantee. Instant download.

Claude API: Streaming vs Batch — Which Saves More (2026)

Claude API: Streaming vs Batch — Which Saves More (2026)

The cost difference, in one table

When streaming makes sense

When Batch API makes sense

Illustrative cost comparison

1 — Email draft generation

2 — Product description generation

3 — Eval runs

The Batch API in practice

Python: submit a batch

Poll and retrieve results

TypeScript

The decision rule

Stacking discounts: Batch + Cache

Frequently Asked Questions

Can I use streaming with the Batch API?

What's the actual time-to-results for Batch API?

Does Batch API support all models?

What's the maximum batch size?

Can I cancel a batch mid-run?

Does streaming reduce my API bill?

See also

Take It Further

Tools and references

Claude API: Streaming vs Batch — Which Saves More (2026)

The cost difference, in one table

When streaming makes sense

When Batch API makes sense

Illustrative cost comparison

1 — Email draft generation

2 — Product description generation

3 — Eval runs

The Batch API in practice

Python: submit a batch

Poll and retrieve results

TypeScript

The decision rule

Stacking discounts: Batch + Cache

Frequently Asked Questions

Can I use streaming with the Batch API?

What's the actual time-to-results for Batch API?

Does Batch API support all models?

What's the maximum batch size?

Can I cancel a batch mid-run?

Does streaming reduce my API bill?

See also

Take It Further

Related guides

Streaming vs Batch in Claude Agent SDK: When to Use Which

Claude API Cost Optimization: Complete Guide to Reducing Your Bill

Claude Batch API: 50% Discount for Async Workloads (2026 Guide)

Claude Batch API & Webhooks: Async Processing at Scale

Tools and references