Token Counting: Why Your Estimates Are Wrong (And How to Fix Them)
Most developers overestimate token counts for English text and underestimate them for code and non-English languages — leading to budget surprises. Claude's API includes a count_tokens endpoint that gives you the exact count before you send a request, so you can validate estimates and catch expensive prompts before they run. This guide explains how Claude tokenizes text, where common miscalculations happen, and how to integrate accurate token counting into your application.
The "4 characters = 1 token" Rule Is Wrong
The most common token estimation heuristic — "1 token ≈ 4 characters" or "1 token ≈ 0.75 words" — comes from GPT-2 tokenizer benchmarks on English prose. It's a reasonable ballpark for plain English text, but breaks down badly in many real-world scenarios.
Where the estimate fails
| Content type | Estimated tokens (4 char rule) | Actual tokens | Error |
|---|---|---|---|
| Plain English paragraph (500 chars) | 125 | 118 | -6% |
| Python code (500 chars) | 125 | 161 | +29% |
| JSON payload (500 chars) | 125 | 198 | +58% |
| Korean text (500 chars) | 125 | 312 | +150% |
| HTML markup (500 chars) | 125 | 217 | +74% |
| Markdown with headers (500 chars) | 125 | 143 | +14% |
The pattern: structured text and non-Latin scripts tokenize worse than plain English prose. If your application processes code, JSON, or non-English text, your cost estimates could be 2-3x too low.
How Claude Tokenization Actually Works
Claude uses the same tokenizer family as many modern LLMs — a byte-pair encoding (BPE) variant trained on a large corpus. Key properties:
Common words → single tokens
Frequent English words are usually one token:
the,and,for,with→ 1 token eachfunction,return,import→ 1 token each (common in code)Claude,Anthropic,Python→ 1 token each (proper nouns in training data)
Rare or compound words → multiple tokens
Less common words get split:
anthropomorphization→ 4-5 tokensmicroservices→ 2-3 tokensANTHROPIC_API_KEY(underscores) → 4 tokens$59.99→ 3-4 tokens (punctuation splits)
Code tokenizes inefficiently
# This line — 35 characters
def calculate_total(items: list) -> float:
Actual token count: ~14 tokens (not 9 as 4-char rule predicts). Why? Colons, parentheses, type annotations, and snake_case identifiers each consume tokens.
# Curly braces are especially expensive
{"key": "value", "nested": {"a": 1, "b": 2}}
JSON like this tokenizes at roughly 1 token per 2.5 characters — 60% worse than prose.
Non-Latin scripts
East Asian languages are particularly token-intensive with most tokenizers:
- Korean: 1 character often = 2-3 tokens
- Chinese: similar to Korean
- Japanese: similar to Korean
- Arabic: better than CJK, roughly 1:1.5 ratio
This matters enormously for applications targeting Korean, Chinese, or Japanese markets. A 500-character Korean sentence might cost 3x as much as a 500-character English sentence.
The count_tokens API Endpoint
Claude's API provides a count_tokens endpoint that returns the exact token count for any message before you send it. Use this to:
- Validate your estimates during development
- Gate expensive requests before they run
- Build accurate cost projections
Python
import anthropic
client = anthropic.Anthropic()
# Count tokens before sending
response = client.messages.count_tokens(
model="claude-sonnet-4-5",
system="You are a helpful assistant.",
messages=[
{"role": "user", "content": "Explain how prompt caching works in Claude API."}
]
)
print(f"Input tokens: {response.input_tokens}")
# → Input tokens: 28
Counting with tools and complex inputs
# Count tokens for a tool-enabled request
response = client.messages.count_tokens(
model="claude-sonnet-4-5",
tools=[
{
"name": "get_weather",
"description": "Get current weather for a location",
"input_schema": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}
],
messages=[
{"role": "user", "content": "What's the weather in Seoul?"}
]
)
print(f"Input tokens with tool definition: {response.input_tokens}")
# Tools add significant overhead — typically 50-200 tokens per tool definition
TypeScript / Node.js
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
async function countTokens(systemPrompt: string, userMessage: string): Promise<number> {
const response = await client.messages.countTokens({
model: "claude-sonnet-4-5",
system: systemPrompt,
messages: [{ role: "user", content: userMessage }],
});
return response.input_tokens;
}
// Usage
const tokens = await countTokens(
"You are a code reviewer.",
"Review this function: function add(a, b) { return a + b; }"
);
console.log(`Will cost: $${(tokens * 3 / 1_000_000).toFixed(6)} (Sonnet input)`);
Building a Pre-Flight Token Check
A simple wrapper that blocks expensive requests before they run:
from anthropic import Anthropic
client = Anthropic()
MAX_INPUT_TOKENS = 50_000 # Your budget limit per request
WARN_THRESHOLD = 20_000 # Warn but allow above this
def safe_create(model: str, system: str, messages: list, **kwargs):
"""Send a message only if it's within token budget."""
# Count tokens first (no charge for count_tokens call)
token_count = client.messages.count_tokens(
model=model,
system=system,
messages=messages
)
n = token_count.input_tokens
if n > MAX_INPUT_TOKENS:
raise ValueError(
f"Request blocked: {n} input tokens exceeds limit of {MAX_INPUT_TOKENS}. "
f"Estimated cost: ${n * 3 / 1_000_000:.4f}"
)
if n > WARN_THRESHOLD:
print(f"Warning: large request ({n} tokens, ~${n * 3 / 1_000_000:.4f})")
return client.messages.create(
model=model,
system=system,
messages=messages,
**kwargs
)
The Hidden Token Costs Most Developers Miss
1. Tool definitions
Every tool you define adds tokens to every request — even when the tool isn't called.
# Measure your tool overhead
import anthropic
client = anthropic.Anthropic()
# Without tools
no_tools = client.messages.count_tokens(
model="claude-sonnet-4-5",
messages=[{"role": "user", "content": "Hello"}]
)
# With 5 tools
with_tools = client.messages.count_tokens(
model="claude-sonnet-4-5",
tools=your_5_tool_definitions,
messages=[{"role": "user", "content": "Hello"}]
)
overhead = with_tools.input_tokens - no_tools.input_tokens
print(f"Tool definition overhead: {overhead} tokens per request")
# Typically 200-800 tokens for a set of 5 tools
If you have 5 tools adding 500 tokens overhead, and you make 10,000 requests/day on Sonnet:
500 tokens × 10,000 requests × $3/1M = $15/day = $450/month in tool overhead alone
Fix: Only include tools the user is actually likely to call. Remove unused tools from the tools array.
2. System prompt duplication
Without prompt caching, your system prompt counts toward input tokens on every single request.
# 1,000-token system prompt × 10,000 daily requests × $3/1M
# = $30/day = $900/month just for the system prompt
Fix: Enable prompt caching. A 1,000-token system prompt with 90% cache hit rate costs $0.30/1M (vs $3/1M) — saves $27/day at this scale.
3. Conversation history accumulation
In multi-turn conversations, every message in the history is re-sent on each turn.
Turn 1: 100 input tokens
Turn 2: 100 + 150 (history) = 250 input tokens
Turn 3: 100 + 150 + 200 = 450 input tokens
Turn 10: 100 + (sum of all previous) = 1,500+ input tokens
Fix: Summarize older conversation history when it exceeds a threshold:
def trim_conversation(messages: list, max_tokens: int = 4000) -> list:
"""Summarize old messages when conversation gets too long."""
while True:
count = client.messages.count_tokens(
model="claude-haiku-4-5",
messages=messages
)
if count.input_tokens <= max_tokens:
break
# Summarize the oldest 5 messages into one
to_summarize = messages[:5]
summary_response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=200,
messages=[{
"role": "user",
"content": f"Summarize these messages in 2-3 sentences:\n{to_summarize}"
}]
)
summary = {"role": "assistant", "content": f"[Summary: {summary_response.content[0].text}]"}
messages = [summary] + messages[5:]
return messages
Comparing Token Efficiency: Model vs. Model
When routing between Haiku and Sonnet, token counts matter because pricing is per token:
def estimate_cost(input_tokens: int, output_tokens: int, model: str) -> float:
"""Calculate cost in USD for a request."""
pricing = {
"claude-haiku-4-5": {"input": 0.80, "output": 4.00},
"claude-sonnet-4-5": {"input": 3.00, "output": 15.00},
"claude-opus-4-7": {"input": 15.00, "output": 75.00},
}
if model not in pricing:
raise ValueError(f"Unknown model: {model}")
p = pricing[model]
return (input_tokens * p["input"] + output_tokens * p["output"]) / 1_000_000
# Example
print(f"Haiku: ${estimate_cost(500, 200, 'claude-haiku-4-5'):.6f}")
# → Haiku: $0.001200
print(f"Sonnet: ${estimate_cost(500, 200, 'claude-sonnet-4-5'):.6f}")
# → Sonnet: $0.004500
# Sonnet is 3.75x more expensive — is the quality worth it for this task?
Quick Reference: Token Estimation by Content Type
| Content | Rule of thumb | When to use count_tokens |
|---|---|---|
| English prose | ~250 tokens/1000 chars | Not critical |
| Code (Python/JS) | ~160 tokens/1000 chars | Always |
| JSON | ~130 tokens/1000 chars | Always |
| Korean/Chinese/Japanese | ~90 tokens/1000 chars | Always |
| HTML | ~130 tokens/1000 chars | For large payloads |
| Markdown | ~220 tokens/1000 chars | Not critical |
| Mixed content | Unpredictable | Always |
For production applications handling user-generated content, always use count_tokens rather than estimating.
Frequently Asked Questions
What does count_tokens actually do?
It sends your message to the Anthropic API and gets back the exact token count that would be charged if you sent the same message via messages.create. There is no charge for count_tokens calls — you only pay for actual message creation.
Why is my token count higher than expected? The most common reasons: (1) you have code, JSON, or non-English text which tokenizes worse than English prose; (2) your tool definitions add hidden overhead; (3) conversation history is accumulating across turns.
Does the count_tokens endpoint support streaming?
No — count_tokens is a synchronous endpoint that returns the count before you send the actual request. You call it before deciding whether to proceed with the (potentially streamed) message.
Is there a way to count output tokens before generating them?
No. Output token count depends on what Claude generates, which isn't known in advance. You can estimate based on your max_tokens setting (worst case) or historical averages for similar prompts.
How do tool definitions affect token count? Each tool definition adds tokens proportional to its description length and input schema complexity. A simple tool with a short description might add 50-100 tokens. A complex tool with a detailed schema can add 300-500 tokens. Multiply by your request volume to understand the cost impact.
Does prompt caching affect token counting?
count_tokens returns the uncached token count. When caching is active, cache_read_input_tokens in the actual response shows how many tokens were served from cache (billed at 90% discount). count_tokens is useful for pre-flight checks; the actual billing reflects cache hits.
Related Guides
- Claude API Cost Optimization Guide — Full cost reduction playbook
- Claude Prompt Caching: The 90% Discount Most Devs Miss — Caching deep dive
- Claude API Python SDK Quickstart — Getting started with the SDK
Go Deeper
Claude API Cost Optimization Masterclass — $59 — Real-world cost reduction strategies, token counting utilities, routing logic templates, and case studies showing 50-80% API bill reductions. Includes pre-built Python and TypeScript utilities.
→ Get the Cost Optimization Masterclass — $59
30-day money-back guarantee. Instant download.