#Claude
34 posts
LLM App Operations #7: Capstone — Taking the Document Q&A Bot to Production
We tie the five pillars of this series into an operations checklist and apply it to the document Q&A bot. Turning on instrumentation, routing, caching, batching, reliability, and security one by one, we watch how per-request cost and stability change, and close out the AI track that spans four series.
LLM App Operations #6: Security — Prompt Injection and Data Boundaries
Prompt injection is an attempt to change an app's behavior through input text, and in the era of RAG and agents it rides in through documents and tool results. We cover layered defenses instead of a single line, minimizing tool permissions, output validation, and the data boundaries of logging.
LLM App Operations #5: Reliability — Rate Limits, Retries, Fallbacks
429 and 529 are not outages, they are daily life. We build a structure that keeps running: how rate limits work (RPM and token limits), retries that respect retry-after, timeouts and streaming, and fallbacks for when nothing else works (model downgrade, queuing, graceful failure).
LLM App Operations #4: Batching — Half Price for Non-Urgent Work
Are you still sending work that does not need an immediate answer through the real-time API? The Batches API processes bulk requests asynchronously in exchange for a 50% discount on every token. We cover picking batch-worthy work, submitting and collecting, and operational patterns.
LLM App Operations #3: Prompt Caching in Practice
Caching the system prompt and tool definitions that repeat on every request cuts the input cost of that portion to one tenth. We cover the prefix-matching principle, stable prefix design, cache_control placement, and an audit for silent cache invalidation.
LLM App Operations #2: Cost — Token Accounting and Model Routing
The biggest lever for cutting cost is model choice. Measuring with count_tokens before sending, putting output on a diet, designing model routing by task difficulty, and tuning effort. We cover the order of operations for lowering cost while protecting quality.
LLM App Operations #1: Between Demo and Production — A Map of Operations
An LLM app that works and an LLM app you can operate are different things. We draw a map of operations along five axes — cost, latency, reliability, quality, and security — and build per-request instrumentation, the starting point for everything.
Advanced RAG #7: Capstone Project — Upgrading the Document Q&A Bot
Upgrade the internal document Q&A bot from LLM App Development Part 13 step by step with the techniques from this series. From measuring the baseline through swapping chunking, hybrid search, reranking, and citations, we watch how the metrics move at every step.
Advanced RAG #6: Building a RAG Evaluation Pipeline
We grow Part 1's baseline into a full evaluation system. Retrieval is scored with recall@k and MRR, generation with an LLM judge, and a single evaluation script that also measures hallucination rate runs as the regression test for every change.
Advanced RAG #5: Reducing Hallucinations with Citations
We tackle generation failures, where answers go wrong even though the right chunks were provided. We implement prompts that keep answers inside the evidence, the right to answer "I do not know," and per-sentence source attribution with the citations feature in Claude.
Advanced RAG #4: Query Transformation and Reranking
User questions are not good search queries. We reinforce both ends of retrieval: query rewriting that folds in conversation context, multi-query that asks from several angles, and reranking that precisely narrows a wide pool of candidates.
Advanced RAG #3: Hybrid Search — Combining Vectors and Keywords
Semantic search is weak on product codes and proper nouns, and keyword search is weak on synonyms. We build BM25 keyword search and fuse it with vector search via RRF, implementing hybrid search where each side covers the weaknesses of the other.