LLM App Operations #1: Between Demo and Production — A Map of Operations

Thursday, June 25, 2026

5 min read

In LLM App Development we built features, and in AI Agent Development and Advanced RAG we raised quality. But a feature that works and a service you can operate are different problems. As users grow, the questions begin: where is the cost leaking, why are responses occasionally slow, what should we do when the API returns a rate limit? This series covers those questions — the operations of an LLM app — across 7 parts.

The five axes of operations #

The concerns of LLM app operations come down to five things. This is also the map of this series.

Axis	Question	Covered in
Cost	How much does one request cost, and where do we cut it?	Part 2 (accounting, routing), Part 3 (caching), Part 4 (batching)
Latency	How long does the user wait?	Part 3 (caching), Part 5 (timeouts)
Reliability	How do we hold up against limits and failures?	Part 5
Quality	Are answers getting better or worse?	Part 7 (evaluation as a routine)
Security	Who is trying to steer my app?	Part 6

It looks like it overlaps with ordinary backend operations, but the center of gravity is different. A typical API call costs effectively the same every time, while an LLM call costs differently per request, with variance up to hundreds of times. Responses legitimately taking tens of seconds, and text embedded in the input being able to alter the app’s behavior, are also peculiarities of LLMs. So the operational tooling needs to be rebuilt for LLMs as well.

The starting point for everything — per-request instrumentation #

Whichever of the five axes you improve, the prerequisite is the same: knowing how much you are spending right now, per request. Fortunately the raw material is in every response: the usage field.

usage_logging.py

import json, time, logging

logger = logging.getLogger("llm")

def call_llm(**kwargs):
    start = time.monotonic()
    response = client.messages.create(**kwargs)
    latency = time.monotonic() - start
    u = response.usage
    logger.info(json.dumps({
        "request_id": response._request_id,        # request ID assigned by the API
        "model": response.model,
        "input_tokens": u.input_tokens,
        "output_tokens": u.output_tokens,
        "cache_read": u.cache_read_input_tokens,    # used in earnest in Part 3
        "cache_write": u.cache_creation_input_tokens,
        "latency_s": round(latency, 2),
        "stop_reason": response.stop_reason,
        "feature": kwargs.get("metadata", {}).get("feature", "unknown"),
    }))
    return response

Routing every call through this single wrapper is the entire exercise of Part 1. There are three points worth noting.

Keep the request_id. The response’s _request_id is the tracing identifier on the API side. When you report an outage or an abnormal response, you need this value to connect the logs on both ends.
Attach a feature tag. “Total cost went up” is not information, but “the summarization feature’s cost went up” is. You need a tag on every call saying which feature it belongs to before you can split cost by feature.
Record stop_reason too. The rate of responses cut off at max_tokens and the refusal rate are early signals of quality problems.

Tokens are money — the structure of pricing #

You need a conversion table that turns the tokens piling up in your logs into money. The structure to remember has two parts: pricing differs by model, and output costs more than input.

Model	Input (per 1M tokens)	Output (per 1M tokens)
claude-opus-4-8	$5.00	$25.00
claude-sonnet-4-6	$3.00	$15.00
claude-haiku-4-5	$1.00	$5.00

(These change over time, so treat the official price list as the source of truth.) The asymmetry — output priced at 5x input — is the foundation of operational intuition. For a typical RAG request with 10,000 input tokens and 500 output tokens, input is 95% of the tokens but output is over 20% of the cost. And if the same request can be handled by haiku, the cost drops to one fifth. We pull these two levers (output length, model choice) in earnest in Part 2.

cost_per_request.py

PRICE = {  # dollars per 1M tokens (input, output)
    "claude-opus-4-8": (5.00, 25.00),
    "claude-sonnet-4-6": (3.00, 15.00),
    "claude-haiku-4-5": (1.00, 5.00),
}

def cost_usd(model: str, input_tokens: int, output_tokens: int) -> float:
    inp, out = PRICE[model]
    return input_tokens / 1e6 * inp + output_tokens / 1e6 * out

Attach this function to your log pipeline and you get a “daily cost by feature” graph. It is the first panel of your operations dashboard.

The baseline — building something to compare against #

Just as we built a baseline for quality improvement in Advanced RAG Part 1, operations needs a baseline too. Run the instrumentation for just a week and numbers like these are in your hands.

Mean and 95th-percentile cost per request, by feature
Mean and 95th-percentile latency per request
Daily token totals and cost
stop_reason distribution (truncation rate, refusal rate)

Every technique in the rest of the series (caching, batching, routing) is judged by comparison against this baseline. It is the foundation that lets you say “input cost went down 41%” instead of “caching seems to have helped.”

Where people commonly trip up #

Seeing cost for the first time on the monthly bill — the bill does not tell you where it leaked. You need per-request usage logging to trace the cause.
Call code scattered everywhere — if even one call site lacks instrumentation, that feature is out of sight. Gather all call paths into the wrapper function.
Looking only at averages — LLM cost and latency have long tails. Even when the average is stable, if the 95th percentile is swinging, some users are already having a bad experience.

Wrapping up #

In this post we drew the map of operations and laid the foundation of instrumentation.

The concerns of operations are five axes: cost, latency, reliability, quality, and security. The biggest difference from an ordinary backend is that an LLM costs differently per request.
Gather every call into a wrapper and record usage, latency, stop_reason, and a feature tag per request.
Know the structure of pricing (differences between models, output at 5x input) and build a baseline with one week of instrumentation.

In the next post, “LLM App Operations #2: Cost — Token Accounting and Model Routing,” we start actually cutting cost on top of this instrumentation, by systematizing the biggest lever — model choice — by task difficulty.