LLM App Operations #3: Prompt Caching in Practice

Saturday, June 27, 2026

6 min read

In Part 2 we trimmed inputs and outputs and split traffic across models. But looking back at the logs from Part 1, a large chunk of input tokens repeats identically on every request: the system prompt, tool definitions, the shared instructions in RAG. Prompt caching stores that repeated portion on the API side and reuses it, and tokens read from the cache cost about one tenth of the base rate. When it lands properly, more than half of your input cost disappears — the single most powerful technique at the operations stage.

The core principle — caching is prefix matching #

The principle you need to work with caching is really one sentence. The cache hits only when the front of the request (the prefix) is byte-for-byte identical. A request is rendered in the order tool definitions → system prompt → messages, and if that serialized front differs from the previous request by even one character, everything from that point on is invalidated.

The design guideline follows directly from this principle. Put what does not change at the front, and what changes at the back.

A cache-friendly request structure

[tool definitions]   ← fixed (order fixed too)
[system prompt]      ← fixed
--- cache_control boundary ---
[conversation history]  ← varies per session
[current question]      ← varies every time

cache_control — marking the boundary #

You mark the cache boundary with cache_control. The prefix up to and including the marked block becomes the cache target.

prompt_caching.py

response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=4096,
    tools=tools,                          # tools render before system
    system=[{
        "type": "text",
        "text": SYSTEM_PROMPT,            # fixed system prompt
        "cache_control": {"type": "ephemeral"},   # cache up to here
    }],
    messages=conversation + [{"role": "user", "content": question}],
)

A single marker on the system block also caches the tool definitions in front of it. The lifetime (TTL) is 5 minutes by default and is extended on every hit. If your service gets requests within 5 minutes of each other, the first request writes the cache and the following requests keep reading it. For sparse traffic there is a 1-hour TTL option ("ttl": "1h"), but the write rate is higher, so you need more hits to break even.

The economics — when is it a win #

Caching is not free; it is a trade. The pricing structure is the decision criterion.

Item	Rate (vs. base input)
Cache write (5-minute TTL)	1.25x
Cache write (1-hour TTL)	2x
Cache read	~0.1x

With the 5-minute TTL, you are already past break-even on the second request (1.25 + 0.1 < 2). In other words, if you call with the same prefix twice or more within 5 minutes, it is always a win. Conversely, if the prefix differs per request (a personalized system prompt, for example) or is too short, the cache either never gets created or you only pay the write cost. Each model has a minimum cacheable prefix length (on the order of a few thousand tokens), so it is worth knowing that short prompts are silently not cached even when marked.

You verify with the logs we already set up in Part 1. If cache_read_input_tokens in usage is nonzero, you are hitting.

From the Part 1 logs

{"feature": "answer_question", "input_tokens": 412,
 "cache_read": 8120, "cache_write": 0, ...}

Of the total 8,532 input tokens, 8,120 came from the cache, so the input cost of this request is about 15% of what it was before caching.

Silent cache invalidation — when the hit rate is 0 #

If you turned on caching and cache_read stays at 0, the prefix is almost certainly differing slightly on every request somewhere. It misses silently with no error, so the fastest way to find it is to work through an audit list.

Dynamic values in the system prompt — insertions like “Current time: …” or “User name: …” are the usual culprits. Move dynamic information out of system and into the messages side (behind the boundary).
Non-deterministic serialization — if key order wobbles when you build tool definitions from a dict, the bytes change. Pin the ordering.
A shifting tool list — adding and removing tools per request invalidates from position 0. In AI Agent Development Part 2 we recommended per-task tool lists; from a caching standpoint, the condition is that the list must be fixed per task.
Model changes — caches are per model. If Part 2’s routing splits traffic across models, each model accumulates its own cache (this is normal; you just need hits within each route).
Session IDs or random values — if a UUID is embedded anywhere in the prefix, you will never hit.

In short, 80% of caching is not the cache_control marker but the discipline of keeping the prefix fixed.

In conversational apps — caching the history too #

Apps with long conversations, like chatbots, can go one step further: put not just the system prompt but the past conversation history inside the cache boundary. Place a marker on the block just before the last user message, and on the next turn everything up through the previous turn is read from the cache. As turns accumulate, the input grows but you only pay for the new message. The effect is especially large in workloads where calls run back-to-back on the same conversation, like the loop in the agent series.

Where people commonly trip up #

Marking without verifying — attaching cache_control is not the end. You only catch silent invalidation by confirming in the logs that the cache_read metric is actually nonzero.
Putting dynamic values in the system prompt — one line with a date invalidates the entire cache. Send everything that changes behind the boundary.
Expecting it to work on short prompts — below the minimum length, marking does nothing. Apply caching first to features with long fixed prefixes.

Wrapping up #

In this post we made repeated input one tenth the price.

Caching is prefix matching. A structure that puts the unchanging at the front and the changing at the back is where everything starts.
With the 5-minute TTL, you win from the second request on. Verify hits with the cache_read metric.
A hit rate of 0 is usually silent invalidation (dynamic values, serialization wobble, shifting tools). Find it with the audit list.

So far we have dealt with the cost of real-time requests. But some work does not need an answer right away. In the next post, “LLM App Operations #4: Batching — Half Price for Non-Urgent Work,” we use the Batches API to cut the cost of non-real-time work in half.