LLM App Operations #2: Cost — Token Accounting and Model Routing

6 min read

With the instrumentation from Part 1, you can now see where the money goes. Now we cut it. There are many ways to reduce cost, but they differ greatly in impact, so this post works through the big levers in order: shrink the input, shrink the output, and then the biggest one — routing each task to the model it actually needs.

Measure before you send — count_tokens #

Start with the basic tool of token accounting. The API has a count_tokens endpoint that counts tokens without sending the request.

count_tokens.py
count = client.messages.count_tokens(
    model="claude-opus-4-8",
    system=SYSTEM,
    messages=messages,
)
print(count.input_tokens)   # input token count for this request

It has two uses. First, as an upper-bound gate. If a user uploads an abnormally large document, you know before sending and can reject or split it. Second, as a design-time estimate. When you edit the system prompt or change the number of RAG chunks, you know how many tokens the input grew by before you deploy. One caveat: tokenization is model-specific, so count with the model you will actually use.

Input diet — suspect the things that pile up #

When you examine the features with large input-token counts in the Part 1 logs, the culprit is usually not the main content but the things that ride along with it.

  • Conversation accumulation — the chatbot resends the entire conversation every time. The summarization and truncation techniques from LLM App Development Part 9 are cost techniques too.
  • RAG chunk excess — if you bumped up top_k and forgot, every request is that much heavier. The reranking from Advanced RAG Part 4 is a device that handles quality and cost at the same time.
  • Bloated tool results — JSON that an agent’s tools return wholesale. The result caps from AI Agent Development Part 4 do their job here too.

The point of this section is that the mechanisms you already built in other series were also cost mechanisms. There is little to build from scratch; most of the work is using the Part 1 instrumentation to find which one has slipped out of place.

Output diet — tokens at 5x the unit price #

Output costs 5x the unit price of input, so cutting the same number of tokens has 5x the effect. There are two handles.

  • Direct the length in the prompt. Explicit instructions like “within 3 sentences” or “output JSON only, skip the explanation” are the most effective. Especially in features that use structured output, the preamble and elaboration the model adds are pure cost.
  • Use max_tokens as a safety net. max_tokens is a cap, not a parameter that steers length. Tighten it too much and the truncation we saw in Agent Part 1 (stop_reason: max_tokens) increases, and the retry cost ends up added on instead. Keep it generous while watching the truncation rate in your Part 1 logs.

Model routing — the biggest lever #

This is the main subject of this post. As we saw in the Part 1 price table, unit prices between models differ by as much as 5x. But the tasks inside your app are not all the same difficulty. Classification, routing decisions, and summarization are fine on a small model, while complex reasoning and agent work need a big one. Sending each task to its own model is model routing, and in most services the biggest cost savings come from here.

model_routing.py
ROUTES = {
    # simple tasks — fast, cheap model
    "classify_intent":  {"model": "claude-haiku-4-5",  "max_tokens": 256},
    "rewrite_query":    {"model": "claude-haiku-4-5",  "max_tokens": 300},
    "summarize_doc":    {"model": "claude-sonnet-4-6", "max_tokens": 1024},
    # core tasks — quality first
    "answer_question":  {"model": "claude-opus-4-8",   "max_tokens": 4096},
    "agent_task":       {"model": "claude-opus-4-8",   "max_tokens": 16000},
}

def call(task: str, **kwargs):
    route = ROUTES[task]
    return call_llm(model=route["model"], max_tokens=route["max_tokens"],
                    metadata={"feature": task}, **kwargs)

There are three principles of routing design.

  • Set the quality bar by measurement. “Is haiku enough for this task” is decided by an eval set, not by gut feeling. The evaluation pipeline from Advanced RAG Part 6 can be used as-is for model comparison. If accuracy is the same, the cheaper model is the right answer.
  • Find where you were already routing. Just as the query rewriting in Advanced RAG Part 4 was on haiku, auxiliary tasks are already candidates. Look at per-feature cost in the Part 1 logs and start by moving the simple tasks that are going to an expensive model.
  • Send borderline tasks upward. When in doubt, leave it on the expensive model and move it down once you have enough measurement. The failure mode of routing is not cost but quality degradation, and that side is more expensive to recover from.

effort — the dial within a single model #

There is one more dial you can turn without changing the model: the effort parameter. Agent Part 3 introduced it from the quality angle, but from an operations standpoint it is also a dial for token usage. Lower effort and the model thinks and answers more concisely, using fewer tokens; raise it and the opposite happens.

effort_by_task.py
# simple extraction — save tokens with low effort
response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    output_config={"effort": "low"},
    messages=[...],
)

In line with routing, putting effort next to the model in the task definition gathers all the dials in one place. But as we saw in Part 3, for tasks where deep reasoning reduces the number of steps — such as agent tasks — lowering effort can actually raise total cost. Here too, the deciding criterion is measurement.

Common stumbling blocks #

  • Sending every task to one model — “just use the best model” is a demo-time habit. Spending 5x the unit price on a simple classification task adds up fast.
  • Downgrading a model without quality verification — cost shows up immediately; quality degradation shows up late. The order is eval-set comparison before downgrading, quality metric monitoring after.
  • Trying to shrink output with max_tokens — truncation is not savings, it is a defect. Direct length with the prompt and keep max_tokens as a safety net.

Wrapping up #

In this post we pulled the cost levers from biggest to smallest.

  • Measure with count_tokens before sending, and put the things that ride along with the input (conversation accumulation, RAG chunks, tool results) on a diet.
  • Output costs 5x the unit price. Direct length with the prompt.
  • The biggest lever is model routing by task difficulty. Set the quality bar by eval-set measurement, and tune once more with effort.

But there is one big chunk we have not touched yet: the system prompt and tool definitions that repeat identically on every request. In the next post, “LLM App Operations #3: Prompt Caching in Practice”, we make that repeated portion cost one tenth the price.

X