LLM App Development #12: Cost, Evaluation, and Observability

5 min read

Through Part 11 we covered how to build features. But to actually run the app you built, you need three more things: gauging and cutting cost, measuring the quality of answers, and looking into what is happening. In this post we lay out this operational foundation.

Gauging token cost #

LLM cost is proportional to the amount of tokens exchanged. To check in advance how many tokens the input uses before a call, use the token counting feature.

count_tokens.py
import anthropic

client = anthropic.Anthropic()

result = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": open("long_document.txt").read()}],
)
print(result.input_tokens)  # input token count

Token counts can vary by model, so specify the model you will actually use when measuring. After receiving a response, check actual usage in response.usage. input_tokens and output_tokens let you track what a single call cost.

Cutting cost with prompt caching #

When the same content repeats at the front of every call — like RAG or a long system prompt — caching can save a lot of cost. Once a part is cached, it is processed much more cheaply on the next call.

prompt_caching.py
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": large_shared_context,  # the same large context every time
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": "Question"}],
)

print(response.usage.cache_read_input_tokens)  # tokens read from cache

Caching applies only when the front is identical byte for byte. So put unchanging content (fixed guidance, shared documents) at the front, and put content that changes every time (the question, the time) at the back. Check whether it works with usage.cache_read_input_tokens. If this value stays 0, it is a sign that something changing every time is mixed into the front and breaking the cache.

Controlling cost with the model #

Let me look again at the three tiers from Part 2 from a cost perspective. You do not need to use the most capable model for every call. Splitting tiers by the nature of the task greatly cuts cost.

  • simple classification, short extraction → the cheapest Haiku
  • most real-world work → the balanced Sonnet
  • demanding reasoning, long agent work → the most capable Opus

Even within one app you can use different models per step. For example, classify with Haiku and generate the final answer with Sonnet.

Evaluating answer quality #

How do you know an answer improved when you changed a prompt? Looking at a few by eye is not enough. One way to automate evaluation is to have another LLM call do the grading. This is called LLM-as-judge.

llm_judge.py
def judge(question: str, answer: str) -> str:
    prompt = f"""Evaluate whether the answer below is accurate and helpful for the question.
Reply with only one word: 'good' or 'bad'.

Question: {question}
Answer: {answer}"""

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=10,
        messages=[{"role": "user", "content": prompt}],
    )
    return next(b.text for b in response.content if b.type == "text")

Build a set of questions and expected answers (an eval set) in advance, and run this evaluation whenever you change a prompt or model, and you can compare numerically whether the change raised or lowered quality. Receiving the grading result as structured output (Part 5) makes aggregation even easier.

Looking into behavior #

Since an LLM app gives different answers to the same input, you need records to see what happened when a problem arises. At minimum, keep the input prompt, the answer received, and the tokens used. It is good to also record the request identifier, in case you report an issue to Anthropic.

logging.py
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=messages,
)

log({
    "request_id": response._request_id,            # identifier for tracing issues
    "input_tokens": response.usage.input_tokens,
    "output_tokens": response.usage.output_tokens,
    "stop_reason": response.stop_reason,
})

With records like these you watch the cost trend and track whether answers are often cut off (stop_reason) or whether certain inputs cause problems.

Cutting more with batches #

For large jobs that are not urgent, the Batch API can cut cost in half. Instead of real-time responses, you hand off many requests at once and receive the results later (usually within an hour). Token charges are 50% off.

batch.py
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
from anthropic.types.messages.batch_create_params import Request

batch = client.messages.batches.create(
    requests=[
        Request(
            custom_id=f"item-{i}",
            params=MessageCreateParamsNonStreaming(
                model="claude-sonnet-4-6",
                max_tokens=256,
                messages=[{"role": "user", "content": text}],
            ),
        )
        for i, text in enumerate(texts)
    ]
)

It fits work that does not need an immediate answer, like classifying thousands of reviews or summarizing documents in bulk. It does not fit interactive calls where a user is waiting, but for background batch processing it saves a lot of cost.

Where people commonly trip up #

  • Not noticing the cache break — if the current time or a random value goes into the front of system, the cache breaks every time. Check whether cache_read_input_tokens is 0.
  • Editing prompts without evaluation — changing a prompt by gut without an eval set, you miss that one case improves while another gets worse. Keep an eval set, even a small one.
  • Recording nothing — without records, reproduction and tracing are hard when a problem arises. Keep at least minimal usage and identifiers.

Wrapping up #

In this post we organized the cost, evaluation, and observability needed to operate an app.

  • Measure tokens in advance (count_tokens), cut the repeated front with caching, and pick the model tier that fits the task.
  • Use LLM-as-judge and an eval set to see numerically how a change affected quality.
  • Record usage and request identifiers to track cost and problems.

Now the foundations of both features and operation are in place. In the final post, “LLM App Development #13: A Real-World Project,” we tie all the pieces so far into one and build, from start to finish, a Q&A bot that answers from internal documents.

X