Prompt caching is the single highest-ROI feature in the Anthropic API, and most people using the API still haven’t turned it on. That’s wild, given that correctly applied it routinely cuts API costs by 80–95% on the kinds of workloads most real products run.
This primer covers exactly what it is, the February 2026 update that removed the last blocker for production multi-tenant use, and the precise code changes to get there. Budget ten minutes to read, five minutes to apply.
What prompt caching does
Every time you call the Anthropic API, you send some tokens. Some of those tokens change every call (the user’s latest message). Most of them don’t (the system prompt, tool definitions, long reference documents, conversation history). Without caching, you pay full input-token price on all of them every time.
With caching, Anthropic keeps the unchanging prefix in a warm cache. On the second call with the same prefix, you pay only 10% of input-token price for the cached portion — a 90% discount. You still pay full price for the new tokens at the end.
That’s the whole idea. The math is dramatic: a prompt that’s 90% static (typical for an agent with a long system prompt + tool definitions) gets a ~81% cost reduction on every call after the first.
Why now
Prompt caching has existed since 2024. What changed in 2026 — and why it’s worth writing about now — is the workspace-level cache isolation update Anthropic shipped on February 5, 2026.
Before that update, the cache was scoped to your API key. In a multi-tenant product where different customers share your backend infrastructure, that meant two theoretical concerns:
- Customer A’s cached system prompt content could, under specific timing windows, be read by customer B’s request if it happened to include a byte-identical prefix.
- Even without cross-customer reads, billing attribution got murky — who “paid” for a cache miss and who benefited from the subsequent hit?
In practice (1) was very unlikely to cause real leakage, but it was enough of a theoretical concern that careful engineering teams disabled caching in multi-tenant deployments and ate the cost.
The February 2026 update added workspace IDs: a per-request namespace that isolates cache entries. Now customer A’s prefix and customer B’s prefix, even if byte-identical, occupy separate cache slots with zero cross-read risk. The billing attribution is explicit: each workspace has its own cache stats.
That change removed the last reason not to turn caching on in production. If you were deferring caching for multi-tenant-safety reasons, that reason is gone.
How the cache actually works
Four things worth knowing:
1. You mark what to cache, not everything
Caching is opt-in per block. You insert cache_control markers in your request to tell Anthropic “the content up to this point should be cached”:
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
},
],
messages=[{"role": "user", "content": user_query}],
)
The cache_control marker tells Anthropic to cache everything from the start of the request through that block. The user message at the end is not cached — it’s new content that prices normally.
2. The cache key is the full prefix
Caching is based on exact prefix match. If a single byte in the cached region differs between two requests, it’s a cache miss. This matters for:
- Dates and timestamps. Don’t include
datetime.now()in your cached system prompt. You’ll get 0% cache hit rate. - Per-user personalization. If you personalize the system prompt with the user’s name, each user gets their own cache line. Cache hits happen only on that user’s repeat calls. Still worth it, but the math is different than you might think.
- Tool definitions. If you dynamically generate tool schemas and the generation isn’t deterministic (e.g. non-sorted object keys), the cache will miss. Always serialize deterministically.
3. Cache has a TTL
Cached prefixes expire. Two TTLs are available:
- Standard (5 minutes) — the default. Cheap to create, cheap to miss.
- Extended (1 hour) — opt-in via
cache_control: {"type": "ephemeral", "ttl": "1h"}. Costs 2x input price to write to the cache, but reads are still 10% of normal price. Worth it when your prefix is long and your traffic is sparser than one hit every five minutes.
If your API is getting hit constantly (every few seconds), standard TTL is fine — the cache stays warm naturally. If traffic is intermittent (a user comes back every 20 minutes), extended TTL amortizes better.
4. Up to four cache breakpoints per request
You can place up to four cache_control markers in a single request. This matters for agents with layered context:
system=[
{"type": "text", "text": BASE_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}}, # breakpoint 1
{"type": "text", "text": TOOLS_DESCRIPTION,
"cache_control": {"type": "ephemeral"}}, # breakpoint 2
{"type": "text", "text": f"Project: {project_context}",
"cache_control": {"type": "ephemeral"}}, # breakpoint 3
]
Anthropic will look up the longest-matching prefix and bill you accordingly: everything before breakpoint 1 is a full miss-pay, everything between 1 and 2 pays 10% if the 1-2 slice matches, and so on.
The practical value: you can cache stable stuff (base prompt, tool defs) globally, and per-project stuff per project. Multiple customers doing different project work still all hit the global base prompt cache; only the per-project portion varies.
The workspace-isolation update in detail
As of February 5, 2026, the request shape supports an optional workspace_id parameter:
response = client.messages.create(
model="claude-opus-4-7",
workspace_id="customer-42", # new in Feb 2026
system=[
{"type": "text", "text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}},
],
messages=[{"role": "user", "content": user_query}],
)
Three behaviors worth knowing:
- Cache entries are scoped to
(api_key, workspace_id). Same API key with different workspace IDs gets different cache slots. - Omitting
workspace_idkeeps the old behavior — a shared cache across your API key. This preserves backward compatibility with code written before February. - Usage stats are broken out per workspace in your API dashboard. Each customer’s cache hit rate is separately visible.
For any multi-tenant product, the right move is: set workspace_id to whatever identifies a customer in your system, and forget about it. You get per-customer isolation for free, and the cache performs as well as — usually better than — an unscoped shared cache, because workspace cache slots don’t thrash each other.
What to actually cache
In descending order of payoff:
1. The system prompt. Usually the biggest static block. Cache it always.
2. Tool definitions. Schema JSON for your tools is often 2–10k tokens in a serious agent. Cache it.
3. Reference documents the model consults per-call. RAG results, API docs you paste in, coding style guides. If they’re the same across many calls, cache them.
4. Conversation history. In a multi-turn conversation, the whole prior turn list is stable. Mark a cache breakpoint at the end of the assistant’s last message. The next user message is the only new content.
5. Few-shot examples. Large blocks of example input/output pairs. Classic cache target.
What not to cache
- Anything varying per-request. Timestamps, request IDs, random values — don’t put these in cached regions.
- Very short prefixes. Caching has a minimum — typically 1024 tokens. Shorter than that, the marker is ignored.
- Frequently-updated content. If your “stable” content actually changes every few minutes, you’ll thrash the cache and pay the write premium with nothing to show for it.
Verifying your cache is working
After enabling caching, verify it’s actually hitting. The API response includes usage stats:
{
"usage": {
"input_tokens": 42,
"cache_creation_input_tokens": 8124,
"cache_read_input_tokens": 0,
"output_tokens": 240
}
}
On the first request, cache_creation_input_tokens will be non-zero — you wrote to the cache. On subsequent requests with the same prefix, cache_read_input_tokens will be non-zero and cache_creation_input_tokens will drop to 0. That’s the confirmation.
A quick verification script:
for i in range(3):
r = client.messages.create(...) # same cached prefix each time
print(f"request {i}: read={r.usage.cache_read_input_tokens} "
f"create={r.usage.cache_creation_input_tokens}")
Expected output:
request 0: read=0 create=8124
request 1: read=8124 create=0
request 2: read=8124 create=0
If you see read=0 on requests after the first, something is different in your prefix between calls. Debug by diffing your actual serialized request payloads.
Cost math, concrete
Let’s price a realistic agent workload:
- System prompt + tool definitions + reference docs: 12,000 input tokens, static.
- Per-call user message: 500 input tokens, new each time.
- Output: 800 output tokens, new each time.
- Traffic: 10,000 calls/day.
- Model:
claude-sonnet-4-6(illustrative pricing — check current rates).
Without caching, per-call cost:
- Input: 12,500 tokens × $3/M = $0.0375
- Output: 800 tokens × $15/M = $0.012
- Per call: $0.0495
- Per day: $495
With caching (90% discount on cached input):
- Cache read: 12,000 × $0.30/M = $0.0036
- New input: 500 × $3/M = $0.0015
- Output: 800 × $15/M = $0.012
- Per call: $0.0171
- Per day: $171
That’s a $324/day savings, or ~$118k/year — from a one-line configuration change. Amortizing the occasional cache-miss cost, actual savings will be slightly less, but not meaningfully.
The only way this math doesn’t work in your favor is if your traffic is so sparse that every request is a cache miss. If you’re under ~20 requests/hour on the same prefix, consider extended TTL; below 5 requests/hour, caching might genuinely not help.
The five-minute enablement path
If you’re already using the Anthropic SDK:
-
Find your system prompt call. The one where
system=...is a long string. -
Wrap it in the content-block format with a
cache_controlmarker:system=[ {"type": "text", "text": YOUR_SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}, ] -
Deploy. Watch the
cache_read_input_tokensfield climb in your logs over the next few requests. -
If you’re multi-tenant, add
workspace_id=customer_idto the request. Takes 30 seconds. -
Do the same for tool definitions if you have a large tool schema. Add a second
cache_controlmarker after tools.
That’s the whole thing. Five minutes of work for the cost reduction above.
Pitfalls
Prefix drift from formatting. JSON serializers default to non-stable key ordering in some languages. Use a serializer with sort_keys=True (Python) or equivalent. Diff your actual outgoing payloads if you’re seeing cache misses you don’t understand.
Cache vs. context-window confusion. Caching does not reduce the tokens Anthropic counts against your context window. A 190k-token cached prefix still takes 190k of the 200k window. Caching is a billing feature, not a context-window feature.
Output-caching expectations. Only input tokens are cached. Output tokens are always generated fresh. Caching cannot make a response return faster in the sense of skipping generation — it makes the input processing faster and cheaper. Latency improvements are real but modest.
Workspace-ID sprawl. Don’t create workspace IDs gratuitously — too many low-traffic workspace IDs means lots of cold caches. Group customers into workspaces when it makes sense (e.g. by plan tier), not one-per-user for a million users.
Where to go next
- Anthropic’s prompt caching docs — the authoritative reference, including current pricing.
- The Opus 4.7 primer — caching pairs naturally with the memory tool; the memory-tool system prompt is a prime caching target.
- Run the cost math above with your actual workload numbers. The answer is almost always “enable caching yesterday.”
If I had to pick one API feature every team building with Claude should turn on before everything else, it’s this one. The lift is trivial, the savings are real, and as of February 2026, the last excuse to defer it is gone.