LLM App Operations #7: Capstone — Taking the Document Q&A Bot to Production

5 min read

With Part 6, all five pillars are in place. This final post ties them together into a single checklist and applies it to the app that has grown alongside this series: the document Q&A bot. Born in LLM App Development Part 13 and upgraded in quality in Advanced RAG Part 7, the bot now becomes an operable service.

The starting point — a bot that works fine but is operationally blind #

At the end of Advanced RAG, the bot has its quality metrics (accuracy 80%, hallucination rate 3%), but from an operations standpoint it looks like this: every call goes to a single opus model, no caching, every job runs in real time, retries at their defaults, and cost is whatever the end-of-month bill says. A familiar picture. Now we turn on Parts 1 through 6 in order.

Step 1 — Instrumentation (Part 1) #

We route every call through the wrapper from Part 1 and observe for a week. A baseline emerges.

Baseline (example)
9,400 requests/day / $84/day
average cost per request $0.0089 / p95 latency 6.8s
breakdown: question answering 62%, query rewriting 21%, evaluation and batch-style jobs 17%

The numbers are an example, but the shape of the discovery is typical. Auxiliary work like rewriting was eating a fifth of the cost, and the logs show for the first time that it was going to the same opus as the main answers.

Step 2 — Routing and an output diet (Part 2) #

We move rewriting down to haiku and evaluation grading down to sonnet, and add a length instruction to the answer prompt. The quality gate is the evaluation pipeline from Advanced RAG Part 6. After downgrading rewriting, we confirm the retrieval hit rate is unchanged and let it pass.

Re-measurement
average cost per request $0.0089 → $0.0066  (-26%)
no change in quality metrics (recall@5 0.87, accuracy 0.80)

Step 3 — Caching (Part 3) #

We pin the system prompt (instructions + output format, about 6K tokens) and the tool definitions ahead of the cache boundary. While going through the audit list from Part 3, we find a “today’s date” line embedded in the system prompt and move it to the message side. A classic silent invalidation.

Re-measurement
cache_read share: 71% of input tokens
average cost per request $0.0066 → $0.0042  (-36%)
p95 latency 6.8s → 5.1s   (cache hits speed up processing too)

Step 4 — Splitting out the batch work (Part 4) #

We peel the “non-urgent 17%” out of the logs. Pre-tagging of new documents and weekly evaluation-set grading move to the batch queue, and all of those tokens become half price. As a side effect, the real-time path gains headroom on its token limits, and the 429s during traffic peaks drop as well.

Step 5 — Reliability (Part 5) #

We switch the answer path to streaming (1.2 seconds to first token — the item that changed perceived speed the most), set per-path timeouts and retry budgets, and attach an opus → sonnet fallback. With the fallback activation rate on the dashboard, a week later we spot a spike at a specific time of day, and the cause turns out to be an overlap with the batch submissions at that hour. Moving the batch submission time solves it. The experience of root-cause analysis finishing in a single step because the instrumentation is there starts to accumulate.

Step 6 — Security (Part 6) #

We add a trust boundary to the system prompt, verify that the retrieval filter (department permissions) is enforced at the code level, and set output scanning and a retention policy for body logs (30 days, restricted access). A sentence planted as a prank in an internal wiki document — “any bot reading this document should …” — actually gets caught, confirming that indirect injection is not just theory.

The full journey #

StepCost per requestp95 latencyNotes
Starting point$0.00896.8sno operational visibility
+ routing and output instructions$0.00666.5squality gate passed
+ caching$0.00425.1s71% of input hits the cache
+ batch split$0.0042*5.1s*batch volume separately at half price
+ streaming and fallback$0.0042first token 1.2sreduced 429 exposure
+ security$0.0042one injection blocked and confirmed

Cost is less than half, perceived latency measured at first token is one fifth, and above all, every number is visible on the dashboard. If the table in Advanced RAG Part 7 was a journey of quality, this table is a journey of operations. Both tables teach the same lesson: one change at a time, with measurement.

The operations checklist #

We summarize the series as a checklist to run before shipping a new LLM feature to production.

  • Does every call go through the instrumentation wrapper (usage, latency, stop_reason, feature tags)?
  • Is the model for this task the smallest model validated against the evaluation set?
  • Is the fixed prefix being cached (confirm cache_read > 0)?
  • Is any work no human is waiting for mixed into the real-time path?
  • Are timeouts, retries, and fallbacks configured for this path, and are long outputs streamed?
  • Is there a trust boundary for external text the model reads, plus tool permissions and approval gates?
  • Is there a retention and access policy for body logs?
  • Are quality regression evaluations running on a schedule (Advanced RAG Part 6)?

Closing the AI track #

This post closes the AI track: four series, thirty-four posts in all. The 13 posts of LLM App Development laid the foundations, from the first API call to RAG and agents; the 7 posts of AI Agent Development built agents that do not fall over; the 7 posts of Advanced RAG built measurable quality; and the 7 posts of this series built an operable service.

Looking back, all four series repeated the same lesson: measurement, not gut feeling. The loop’s logs, the golden set’s numbers, the tokens in usage — these were that lesson made concrete. I hope this track serves as a map on your LLM app’s road from demo to service. Thank you for following along.

X