LLM App Development #13: A Real-World Project — Internal Document Q&A Bot

Wednesday, June 10, 2026

5 min read

This is the final post. Let me tie together the pieces covered so far and build a Q&A bot that answers from internal documents. Load company policies or a product manual, and employees ask in natural language and get answers grounded in the documents. Almost all of the core of this series goes into this one app.

What we tie together #

This bot is a combination of the pieces we built earlier.

Retrieval — split documents into chunks, embed them, and find the chunks closest to the question (Part 7, Part 8).
Grounded answers — answer based only on the found chunks, and say “I don’t know” when it is not there (Part 4).
Streaming — flow the answer in real time (Part 3).
Conversation memory — keep history for follow-up questions (Part 2, Part 9).

Preparation: indexing the documents #

First, split the documents into chunks and embed them in advance. This indexing is done only when documents change, not on every question.

qa_index.py

import numpy as np
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("all-MiniLM-L6-v2")

def chunk_text(text: str, size: int = 300, overlap: int = 50):
    chunks, start = [], 0
    while start < len(text):
        chunks.append(text[start:start + size])
        start += size - overlap
    return chunks

# split internal documents into chunks and embed them
documents = [open(f).read() for f in ["policy.txt", "manual.txt"]]
chunks = [c for doc in documents for c in chunk_text(doc)]
chunk_vectors = embedder.encode(chunks)

def retrieve(query: str, top_k: int = 3):
    q = embedder.encode([query])[0]
    scores = chunk_vectors @ q
    ranked = np.argsort(scores)[::-1][:top_k]
    return [chunks[i] for i in ranked]

Building the bot #

Now tie retrieval, the grounding prompt, streaming, and memory into a single function.

qa_bot.py

import anthropic

client = anthropic.Anthropic()
messages = []  # conversation memory

SYSTEM = "You are an assistant that answers questions about internal documents. Answer based only on the material provided, and if it is not in the material, reply 'Not found in the documents.'"

def ask(question: str) -> None:
    # 1) Retrieval: find chunks related to the question
    found = retrieve(question)
    context = "\n\n".join(f"<doc>{c}</doc>" for c in found)

    # 2) Grounding prompt: include the material and the question together
    user_content = f"<docs>\n{context}\n</docs>\n\nQuestion: {question}"
    messages.append({"role": "user", "content": user_content})

    # 3) Streaming: print the answer in real time while collecting it
    answer = ""
    with client.messages.stream(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=SYSTEM,
        messages=messages,
    ) as stream:
        for text in stream.text_stream:
            print(text, end="", flush=True)
            answer += text
    print()

    # 4) Memory: add the answer to the history
    messages.append({"role": "assistant", "content": answer})

# use interactively
ask("How many days of annual leave do I get?")
ask("Can I split those into half-days?")  # remembers prior context and answers in continuation

Let me follow the flow. When a question arrives, we first retrieve the related chunks (1), then put that material and the question into the prompt together (2). The system prompt fixes the rule “only from the material, say I don’t know if absent” to prevent hallucination. The answer is printed in real time via streaming while being collected (3), and when it ends we add it to the history so the next question continues the context (4). This memory is why the bot knows that “those” in the second question refers to annual leave.

Going further from here #

This bot has the whole core structure, but to turn it into a real service you would add more of what we covered earlier.

Cost — when the system and material repeat, cut cost with prompt caching (Part 12). When the conversation grows long, compress the history with summaries (Part 9).
Scale — when documents grow numerous, use a vector database instead of full comparison (Part 7).
Quality — measure the effect of prompt changes with an eval set of questions and expected answers (Part 12).
Web — flow the answer all the way to the browser with StreamingResponse instead of console output (Part 3).
Extending behavior — if you must operate external systems beyond document search, widen into tool calling and agents (Part 6, Part 10).

Closing the series #

At the very start, all we did was exchange a single sentence with one API key. From there we built the basics with message structure and parameters, streaming, prompts, and structured output; connected the app to the outside world with tool calling, RAG, memory, agents, and MCP; and laid the operational foundation with cost, evaluation, and observability. And in this post we tied all of that into a single working app.

The big picture of LLM app development is actually simple. Gather good context, put it in the prompt, take the model’s answer, and connect it safely to code. Retrieval, memory, structured output — all of them were ultimately one part of this larger flow. Combining the pieces learned in this series, you can design an LLM app for your own problem yourself.

This series explained things with Claude as the reference, but most of the ideas covered carry over to other providers as is. Now decide the app you want to build, and add the pieces one at a time, starting from the smallest form. Thank you for following along.