AI

LLM App Operations #7: Capstone — Taking the Document Q&A Bot to Production
5 min read

LLM App Operations #7: Capstone — Taking the Document Q&A Bot to Production

We tie the five pillars of this series into an operations checklist and apply it to the document Q&A bot. Turning on instrumentation, routing, caching, batching, reliability, and security one by one, we watch how per-request cost and stability change, and close out the AI track that spans four series.

LLM App Operations #6: Security — Prompt Injection and Data Boundaries
6 min read

LLM App Operations #6: Security — Prompt Injection and Data Boundaries

Prompt injection is an attempt to change an app's behavior through input text, and in the era of RAG and agents it rides in through documents and tool results. We cover layered defenses instead of a single line, minimizing tool permissions, output validation, and the data boundaries of logging.

LLM App Operations #5: Reliability — Rate Limits, Retries, Fallbacks
6 min read

LLM App Operations #5: Reliability — Rate Limits, Retries, Fallbacks

429 and 529 are not outages, they are daily life. We build a structure that keeps running: how rate limits work (RPM and token limits), retries that respect retry-after, timeouts and streaming, and fallbacks for when nothing else works (model downgrade, queuing, graceful failure).

LLM App Operations #4: Batching — Half Price for Non-Urgent Work
5 min read

LLM App Operations #4: Batching — Half Price for Non-Urgent Work

Are you still sending work that does not need an immediate answer through the real-time API? The Batches API processes bulk requests asynchronously in exchange for a 50% discount on every token. We cover picking batch-worthy work, submitting and collecting, and operational patterns.

LLM App Operations #3: Prompt Caching in Practice
6 min read

LLM App Operations #3: Prompt Caching in Practice

Caching the system prompt and tool definitions that repeat on every request cuts the input cost of that portion to one tenth. We cover the prefix-matching principle, stable prefix design, cache_control placement, and an audit for silent cache invalidation.

LLM App Operations #2: Cost — Token Accounting and Model Routing
6 min read

LLM App Operations #2: Cost — Token Accounting and Model Routing

The biggest lever for cutting cost is model choice. Measuring with count_tokens before sending, putting output on a diet, designing model routing by task difficulty, and tuning effort. We cover the order of operations for lowering cost while protecting quality.

LLM App Operations #1: Between Demo and Production — A Map of Operations
5 min read

LLM App Operations #1: Between Demo and Production — A Map of Operations

An LLM app that works and an LLM app you can operate are different things. We draw a map of operations along five axes — cost, latency, reliability, quality, and security — and build per-request instrumentation, the starting point for everything.

Advanced RAG #7: Capstone Project — Upgrading the Document Q&A Bot
6 min read

Advanced RAG #7: Capstone Project — Upgrading the Document Q&A Bot

Upgrade the internal document Q&A bot from LLM App Development Part 13 step by step with the techniques from this series. From measuring the baseline through swapping chunking, hybrid search, reranking, and citations, we watch how the metrics move at every step.

Advanced RAG #6: Building a RAG Evaluation Pipeline
6 min read

Advanced RAG #6: Building a RAG Evaluation Pipeline

We grow Part 1's baseline into a full evaluation system. Retrieval is scored with recall@k and MRR, generation with an LLM judge, and a single evaluation script that also measures hallucination rate runs as the regression test for every change.

Advanced RAG #5: Reducing Hallucinations with Citations
5 min read

Advanced RAG #5: Reducing Hallucinations with Citations

We tackle generation failures, where answers go wrong even though the right chunks were provided. We implement prompts that keep answers inside the evidence, the right to answer "I do not know," and per-sentence source attribution with the citations feature in Claude.

Advanced RAG #4: Query Transformation and Reranking
6 min read

Advanced RAG #4: Query Transformation and Reranking

User questions are not good search queries. We reinforce both ends of retrieval: query rewriting that folds in conversation context, multi-query that asks from several angles, and reranking that precisely narrows a wide pool of candidates.

Advanced RAG #3: Hybrid Search — Combining Vectors and Keywords
2 min read

Advanced RAG #3: Hybrid Search — Combining Vectors and Keywords

Semantic search is weak on product codes and proper nouns, and keyword search is weak on synonyms. We build BM25 keyword search and fuse it with vector search via RRF, implementing hybrid search where each side covers the weaknesses of the other.