Guide
Prompt Caching: Cut Your LLM Bill 80% Without Changing Models
A working guide to prompt caching on Anthropic and OpenAI. What it actually does, the exact API changes, the savings you should expect, and the three mistakes that quietly kill your cache hit rate.
June 2, 2026 · 12 min read
Why this matters
Walk into any company shipping AI features today and look at their input token bill. Then look at how much of that input is the same system prompt being sent again and again. For most teams, the answer is “almost all of it.”
A real example: an agent product I helped audit was paying for 12,000 tokens of system prompt, tool definitions, and few-shot examples on every single inference call. They were doing 400,000 calls per day. That is 4.8 billion input tokens per day, of which roughly 4.79 billion was duplicate context they were paying for over and over.
Their monthly input bill: $96,000. Their input bill with prompt caching turned on: $14,000. Same product, same model, same prompts. One config change. $82,000 per month.
Prompt caching is the cheapest meaningful optimization in modern LLM work, and most teams have not turned it on yet. Here is the whole playbook.
How it actually works
Most posts about prompt caching get this wrong. They describe it like a key-value cache: “the model remembers your last prompt.” That is not what is happening.
What actually happens: when you send a prompt, the model’s first job is to run the entire prompt through its attention layers and produce a stack of intermediate tensors called the KV cache. This prefill step is what dominates time-to-first-token. The longer your prompt, the more prefill work, the slower the first token.
Prompt caching tells the provider: store the KV cache for the prefix of this prompt, and reuse it on subsequent calls where the same prefix appears. When a cache hit lands, the provider skips the prefill work for the cached portion. You pay for the cached input at a steep discount (Anthropic charges 0.1x the standard input rate; OpenAI charges 0.5x), and TTFT drops dramatically.
The key word is prefix. Caching only works if the leading bytes of your prompt are identical to a previously-sent prompt. One byte of difference at the front and the cache misses.
This is why the order of fields in your prompt matters. We will come back to this.
What you should expect to save
Numbers from production deployments, averaged across teams I have worked with and matched against Anthropic’s published data:
- Input cost reduction: 45-80%, depending on what fraction of your average prompt is stable.
- Time to first token: 13-31% faster on cache hits.
- Cache hit rate: 60-95% in well-structured agent workloads, much lower if you fight the format.
The dramatic TTFT win is the underreported one. A 20,000-token system prompt that took 3-4 seconds to prefill from scratch streams its first token in under 500ms on a cache hit. For interactive agent loops where you want responsiveness, this changes the perceived “feel” of the product more than the cost savings do.
The exact Anthropic setup
Three lines of API change. Here is the before:
const response = await client.messages.create({
model: "claude-opus-4-7",
max_tokens: 1024,
system: SYSTEM_PROMPT,
tools: TOOL_DEFINITIONS,
messages: [{ role: "user", content: userInput }],
});
Here is the after, with caching:
const response = await client.messages.create({
model: "claude-opus-4-7",
max_tokens: 1024,
system: [
{
type: "text",
text: SYSTEM_PROMPT,
cache_control: { type: "ephemeral" },
},
],
tools: TOOL_DEFINITIONS,
messages: [{ role: "user", content: userInput }],
});
The change: wrap the system prompt as an array of content blocks, add cache_control: { type: "ephemeral" } to the last block of the stable prefix.
The default TTL is 5 minutes. If you have a high-traffic endpoint, every call within 5 minutes after the first one will be a cache hit. For lower-traffic endpoints, you can opt into a 1-hour TTL for an additional charge.
You can also cache the tool definitions:
tools: [
...TOOL_DEFINITIONS.slice(0, -1),
{ ...TOOL_DEFINITIONS.at(-1), cache_control: { type: "ephemeral" } },
],
The cache_control flag marks the end of a cache block. Everything from the start of the prompt up to that flag gets cached.
The exact OpenAI setup
OpenAI’s prompt caching is automatic — no API change required — but the structure of your prompt determines whether you hit the cache. The rule: any prompt where the first ~1024 tokens match a recent prompt within the cache TTL will get cached input pricing on those tokens.
The implication: put your stable content at the front of the prompt, in the same order every time. This is easy to violate by accident if you concatenate dynamic strings into your system message without thinking about order.
You can confirm a cache hit by inspecting the response usage object:
const response = await client.chat.completions.create({
model: "gpt-5",
messages: [
{ role: "system", content: STABLE_SYSTEM_PROMPT },
{ role: "user", content: userInput },
],
});
console.log(response.usage.prompt_tokens_details.cached_tokens);
// e.g. 4096 — that's the number of tokens you got at the cached rate
If cached_tokens is 0 on calls where you expected a hit, something is breaking the prefix match. Usually it is dynamic content sneaking into the system message.
The three mistakes that kill your hit rate
Mistake 1: Stuffing dynamic content into the system message.
Common pattern: “let me put the user’s timezone and the current date into the system prompt so the model knows it.” Wrong. Anything that changes between calls breaks the prefix match. Move dynamic content into the user message, or into a dedicated low-position content block after the cached prefix.
Mistake 2: Reordering tool definitions.
If you generate your tool definitions from a registry that does not guarantee order, the JSON serialization can differ between calls. One day tools[0] is “search”, the next day it is “fetch”. Cache hit rate drops to zero because the prefix is different.
Fix: sort tool definitions alphabetically by name, or serialize them with a stable key order, before sending.
Mistake 3: Treating few-shot examples as dynamic.
Some agent frameworks rotate which few-shot examples they include based on the user query, hoping to give the model the most “relevant” examples. This is the worst possible thing you can do for cache hit rate. Pick a fixed set of examples for your agent and include them in the same order on every call. The marginal quality win from rotating examples is almost always smaller than the marginal cost win from a 90%+ cache hit rate.
How to measure and prove the savings
You cannot improve what you do not measure. Wire up a daily metric for cache hit rate:
import { client } from "./anthropic";
let total_input = 0;
let cached_input = 0;
async function instrumented(messages: any[]) {
const response = await client.messages.create({ ... });
total_input += response.usage.input_tokens + response.usage.cache_read_input_tokens;
cached_input += response.usage.cache_read_input_tokens;
return response;
}
// At end of day:
console.log(`cache hit rate: ${(cached_input / total_input * 100).toFixed(1)}%`);
console.log(`tokens saved at 90% discount: ${cached_input}`);
console.log(`approximate dollar saving: $${(cached_input / 1_000_000 * 13.5).toFixed(2)}`);
(The 13.5 figure is the per-million savings on Opus input at the 90% cache discount. Adjust for your model and provider.)
Report this number weekly. Once it stabilizes above 80%, you have done the work. Anything lower than 60% means you have one of the three mistakes above leaking into your system prompt.
What to do this week
- Pull your largest API endpoint by token volume.
- Print its full system prompt. Identify the parts that are identical on every call (system message, tool definitions, few-shot examples) and the parts that change (user input, retrieved context, dynamic state).
- Reorganize: stable stuff goes to the front, in a deterministic order. Dynamic stuff goes after the cache boundary.
- Add
cache_controlto the last block of the stable prefix on Anthropic, or trust the automatic caching on OpenAI. - Wire up the hit-rate metric above.
- Watch your input bill drop by 60-80% in the next billing cycle.
The teams that do this in the first month of running a serious AI product save tens of thousands of dollars across the year. The teams that wait six months pay every dollar of that money to the provider. The provider is happy either way; you should not be.
Turn it on this week.