Token Counting: Why Your Estimates Are Wrong (And How to Fix Them)

Q: Does the `count_tokens` endpoint support streaming?

No — count_tokens is a synchronous endpoint that returns the count before you send the actual request. You call it before deciding whether to proceed with the (potentially streamed) message.

Q: Does prompt caching affect token counting?

count_tokens returns the uncached token count. When caching is active, cache_read_input_tokens in the actual response shows how many tokens were served from cache (billed at 90% discount). count_tokens is useful for pre-flight checks; the actual billing reflects cache hits.

Most developers overestimate token counts for English text and underestimate them for code and non-English languages — leading to budget surprises. Claude's API includes a count_tokens endpoint that gives you the exact count before you send a request, so you can validate estimates and catch expensive prompts before they run. This guide explains how Claude tokenizes text, where common miscalculations happen, and how to integrate accurate token counting into your application.

The "4 characters = 1 token" Rule Is Wrong

The most common token estimation heuristic — "1 token ≈ 4 characters" or "1 token ≈ 0.75 words" — comes from GPT-2 tokenizer benchmarks on English prose. It's a reasonable ballpark for plain English text, but breaks down badly in many real-world scenarios.

Where the estimate fails

Content type	Estimated tokens (4 char rule)	Actual tokens	Error
Plain English paragraph (500 chars)	125	118	-6%
Python code (500 chars)	125	161	+29%
JSON payload (500 chars)	125	198	+58%
Korean text (500 chars)	125	312	+150%
HTML markup (500 chars)	125	217	+74%
Markdown with headers (500 chars)	125	143	+14%

The pattern: structured text and non-Latin scripts tokenize worse than plain English prose. If your application processes code, JSON, or non-English text, your cost estimates could be 2-3x too low.

How Claude Tokenization Actually Works

Claude uses the same tokenizer family as many modern LLMs — a byte-pair encoding (BPE) variant trained on a large corpus. Key properties:

Common words → single tokens

Frequent English words are usually one token:

the, and, for, with → 1 token each
function, return, import → 1 token each (common in code)
Claude, Anthropic, Python → 1 token each (proper nouns in training data)

Rare or compound words → multiple tokens

Less common words get split:

anthropomorphization → 4-5 tokens
microservices → 2-3 tokens
ANTHROPIC_API_KEY (underscores) → 4 tokens
$59.99 → 3-4 tokens (punctuation splits)

Code tokenizes inefficiently

# This line — 35 characters
def calculate_total(items: list) -> float:

Actual token count: ~14 tokens (not 9 as 4-char rule predicts). Why? Colons, parentheses, type annotations, and snake_case identifiers each consume tokens.

# Curly braces are especially expensive
{"key": "value", "nested": {"a": 1, "b": 2}}

JSON like this tokenizes at roughly 1 token per 2.5 characters — 60% worse than prose.

Non-Latin scripts

East Asian languages are particularly token-intensive with most tokenizers:

Korean: 1 character often = 2-3 tokens
Chinese: similar to Korean
Japanese: similar to Korean
Arabic: better than CJK, roughly 1:1.5 ratio

This matters enormously for applications targeting Korean, Chinese, or Japanese markets. A 500-character Korean sentence might cost 3x as much as a 500-character English sentence.

The `count_tokens` API Endpoint

Claude's API provides a count_tokens endpoint that returns the exact token count for any message before you send it. Use this to:

Validate your estimates during development
Gate expensive requests before they run
Build accurate cost projections

Python

import anthropic

client = anthropic.Anthropic()

# Count tokens before sending
response = client.messages.count_tokens(
    model="claude-sonnet-4-5",
    system="You are a helpful assistant.",
    messages=[
        {"role": "user", "content": "Explain how prompt caching works in Claude API."}
    ]
)

print(f"Input tokens: {response.input_tokens}")
# → Input tokens: 28

Counting with tools and complex inputs

# Count tokens for a tool-enabled request
response = client.messages.count_tokens(
    model="claude-sonnet-4-5",
    tools=[
        {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "input_schema": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    ],
    messages=[
        {"role": "user", "content": "What's the weather in Seoul?"}
    ]
)

print(f"Input tokens with tool definition: {response.input_tokens}")
# Tools add significant overhead — typically 50-200 tokens per tool definition

TypeScript / Node.js

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function countTokens(systemPrompt: string, userMessage: string): Promise<number> {
  const response = await client.messages.countTokens({
    model: "claude-sonnet-4-5",
    system: systemPrompt,
    messages: [{ role: "user", content: userMessage }],
  });
  return response.input_tokens;
}

// Usage
const tokens = await countTokens(
  "You are a code reviewer.",
  "Review this function: function add(a, b) { return a + b; }"
);
console.log(`Will cost: $${(tokens * 3 / 1_000_000).toFixed(6)} (Sonnet input)`);

Building a Pre-Flight Token Check

A simple wrapper that blocks expensive requests before they run:

from anthropic import Anthropic

client = Anthropic()

MAX_INPUT_TOKENS = 50_000   # Your budget limit per request
WARN_THRESHOLD = 20_000     # Warn but allow above this

def safe_create(model: str, system: str, messages: list, **kwargs):
    """Send a message only if it's within token budget."""
    
    # Count tokens first (no charge for count_tokens call)
    token_count = client.messages.count_tokens(
        model=model,
        system=system,
        messages=messages
    )
    
    n = token_count.input_tokens
    
    if n > MAX_INPUT_TOKENS:
        raise ValueError(
            f"Request blocked: {n} input tokens exceeds limit of {MAX_INPUT_TOKENS}. "
            f"Estimated cost: ${n * 3 / 1_000_000:.4f}"
        )
    
    if n > WARN_THRESHOLD:
        print(f"Warning: large request ({n} tokens, ~${n * 3 / 1_000_000:.4f})")
    
    return client.messages.create(
        model=model,
        system=system,
        messages=messages,
        **kwargs
    )

The Hidden Token Costs Most Developers Miss

1. Tool definitions

Every tool you define adds tokens to every request — even when the tool isn't called.

# Measure your tool overhead
import anthropic
client = anthropic.Anthropic()

# Without tools
no_tools = client.messages.count_tokens(
    model="claude-sonnet-4-5",
    messages=[{"role": "user", "content": "Hello"}]
)

# With 5 tools
with_tools = client.messages.count_tokens(
    model="claude-sonnet-4-5",
    tools=your_5_tool_definitions,
    messages=[{"role": "user", "content": "Hello"}]
)

overhead = with_tools.input_tokens - no_tools.input_tokens
print(f"Tool definition overhead: {overhead} tokens per request")
# Typically 200-800 tokens for a set of 5 tools

If you have 5 tools adding 500 tokens overhead, and you make 10,000 requests/day on Sonnet:

500 tokens × 10,000 requests × $3/1M = $15/day = $450/month in tool overhead alone

Fix: Only include tools the user is actually likely to call. Remove unused tools from the tools array.

2. System prompt duplication

Without prompt caching, your system prompt counts toward input tokens on every single request.

# 1,000-token system prompt × 10,000 daily requests × $3/1M
# = $30/day = $900/month just for the system prompt

Fix: Enable prompt caching. A 1,000-token system prompt with 90% cache hit rate costs $0.30/1M (vs $3/1M) — saves $27/day at this scale.

3. Conversation history accumulation

In multi-turn conversations, every message in the history is re-sent on each turn.

Turn 1: 100 input tokens
Turn 2: 100 + 150 (history) = 250 input tokens
Turn 3: 100 + 150 + 200 = 450 input tokens
Turn 10: 100 + (sum of all previous) = 1,500+ input tokens

Fix: Summarize older conversation history when it exceeds a threshold:

def trim_conversation(messages: list, max_tokens: int = 4000) -> list:
    """Summarize old messages when conversation gets too long."""
    while True:
        count = client.messages.count_tokens(
            model="claude-haiku-4-5",
            messages=messages
        )
        if count.input_tokens <= max_tokens:
            break
        
        # Summarize the oldest 5 messages into one
        to_summarize = messages[:5]
        summary_response = client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=200,
            messages=[{
                "role": "user",
                "content": f"Summarize these messages in 2-3 sentences:\n{to_summarize}"
            }]
        )
        
        summary = {"role": "assistant", "content": f"[Summary: {summary_response.content[0].text}]"}
        messages = [summary] + messages[5:]
    
    return messages

Comparing Token Efficiency: Model vs. Model

When routing between Haiku and Sonnet, token counts matter because pricing is per token:

def estimate_cost(input_tokens: int, output_tokens: int, model: str) -> float:
    """Calculate cost in USD for a request."""
    pricing = {
        "claude-haiku-4-5":   {"input": 0.80, "output": 4.00},
        "claude-sonnet-4-5":  {"input": 3.00, "output": 15.00},
        "claude-opus-4-7":    {"input": 15.00, "output": 75.00},
    }
    
    if model not in pricing:
        raise ValueError(f"Unknown model: {model}")
    
    p = pricing[model]
    return (input_tokens * p["input"] + output_tokens * p["output"]) / 1_000_000


# Example
print(f"Haiku: ${estimate_cost(500, 200, 'claude-haiku-4-5'):.6f}")
# → Haiku: $0.001200

print(f"Sonnet: ${estimate_cost(500, 200, 'claude-sonnet-4-5'):.6f}")
# → Sonnet: $0.004500

# Sonnet is 3.75x more expensive — is the quality worth it for this task?

Quick Reference: Token Estimation by Content Type

Content	Rule of thumb	When to use count_tokens
English prose	~250 tokens/1000 chars	Not critical
Code (Python/JS)	~160 tokens/1000 chars	Always
JSON	~130 tokens/1000 chars	Always
Korean/Chinese/Japanese	~90 tokens/1000 chars	Always
HTML	~130 tokens/1000 chars	For large payloads
Markdown	~220 tokens/1000 chars	Not critical
Mixed content	Unpredictable	Always

For production applications handling user-generated content, always use count_tokens rather than estimating.

Frequently Asked Questions

What does count_tokens actually do? It sends your message to the Anthropic API and gets back the exact token count that would be charged if you sent the same message via messages.create. There is no charge for count_tokens calls — you only pay for actual message creation.

Why is my token count higher than expected? The most common reasons: (1) you have code, JSON, or non-English text which tokenizes worse than English prose; (2) your tool definitions add hidden overhead; (3) conversation history is accumulating across turns.

Does the count_tokens endpoint support streaming? No — count_tokens is a synchronous endpoint that returns the count before you send the actual request. You call it before deciding whether to proceed with the (potentially streamed) message.

Is there a way to count output tokens before generating them? No. Output token count depends on what Claude generates, which isn't known in advance. You can estimate based on your max_tokens setting (worst case) or historical averages for similar prompts.

How do tool definitions affect token count? Each tool definition adds tokens proportional to its description length and input schema complexity. A simple tool with a short description might add 50-100 tokens. A complex tool with a detailed schema can add 300-500 tokens. Multiply by your request volume to understand the cost impact.

Does prompt caching affect token counting? count_tokens returns the uncached token count. When caching is active, cache_read_input_tokens in the actual response shows how many tokens were served from cache (billed at 90% discount). count_tokens is useful for pre-flight checks; the actual billing reflects cache hits.

Related Guides

Claude API Cost Optimization Guide — Full cost reduction playbook
Claude Prompt Caching: The 90% Discount Most Devs Miss — Caching deep dive
Claude API Python SDK Quickstart — Getting started with the SDK

Go Deeper

Claude API Cost Optimization Masterclass — $59 — Real-world cost reduction strategies, token counting utilities, routing logic templates, and case studies showing 50-80% API bill reductions. Includes pre-built Python and TypeScript utilities.

→ Get the Cost Optimization Masterclass — $59

30-day money-back guarantee. Instant download.