LLM App Development #9: Conversation Memory and Context Management
As we saw in Part 2, the API stores no state, so to continue a conversation you send the history in full every time. But as a conversation grows, this history piles up endlessly. Two problems arise. The tokens you send each time grow, raising cost, and eventually you hit the model’s context limit on how much it can take at once. In this post we organize how to handle this history.
The problem of piling-up history #
At each turn of a conversation, a user message and an answer are added to messages. Ten exchanges make 20, a hundred make 200. Every call resends this whole thing, so sending one 100th question means sending all 99 prior exchanges along with it.
Two things cause problems here.
- Token cost — input tokens grow in proportion to the length of the history. The longer the conversation, the more each single answer costs.
- Context limit — each model has a cap on the tokens it can take at once. Once the history exceeds that limit, you can no longer send it.
So in a long conversation, you cannot leave the history as is; you must shrink it somehow. The methods are broadly trimming and summarizing.
Sliding window — keeping only the recent #
The simplest method is to keep only the last few messages and drop the old ones. Always looking at the recent stretch, as if sliding a window, is why it is called a sliding window.
MAX_TURNS = 10 # keep only the last 10 turns (20 messages)
def trim(messages):
# system is a separate parameter, so messages holds only user/assistant
if len(messages) > MAX_TURNS * 2:
return messages[-MAX_TURNS * 2:]
return messagesIt is easy to implement and the cost stays stable. The downside is that it forgets old conversation entirely. Even if “my name is Minsu” was said earlier, once it slides out of the window that fact is gone. So it fits light chatbots where only recent context matters.
Compressing with summaries #
If dropping old conversation feels wasteful, summarize it instead of discarding it. You have Claude briefly summarize the earlier conversation into one chunk and place that summary at the front of the history.
def summarize(old_messages) -> str:
text = "\n".join(f"{m['role']}: {m['content']}" for m in old_messages)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{
"role": "user",
"content": f"Summarize the following conversation, keeping only the essentials needed to maintain context going forward:\n\n{text}",
}],
)
return next(b.text for b in response.content if b.type == "text")
# summarize the old front, keep the recent conversation as is
summary = summarize(messages[:-10])
messages = [
{"role": "user", "content": f"[Summary of earlier conversation] {summary}"},
] + messages[-10:]This way, old information like the name “Minsu” survives in the summary. Tokens shrink too. In exchange, summarizing costs one extra call, and some detail may be lost in the process. It fits apps that continue a long consultation or task.
Letting the server compress for you #
Instead of writing this compression yourself, you can have the Claude API handle it automatically on the server. When the context approaches the limit, the compaction feature summarizes the earlier content for you. It is currently offered in beta.
response.content to the history rather than extracting and stacking only the answer text. If you store only the text, you lose the compaction state.Whether to manage it yourself or leave it to the server depends on the app. If you want fine control over the behavior, write a sliding window or summary yourself; if you want to continue a long conversation conveniently, use compaction.
When to use which strategy #
Which strategy to use depends on the situation.
- Sliding window — easy to implement, with stable cost. Fits light chatbots where only recent context matters. The downside is forgetting old information.
- Summary — keeps old information alive, but costs an extra summary call and loses some detail. Fits apps continuing a long consultation or task.
In practice the two are sometimes mixed. Keep the last few turns verbatim to preserve accuracy, and compress the older part into a single summary chunk. You get both the accuracy of recent conversation and the preservation of old context.
Either way, the yardstick is tokens. Measure how many tokens the history uses with the token counting covered in #12, and trim or summarize once it passes a set limit — that way you manage cost and the context limit together.
How is this different from RAG #
Since it can be confused with RAG from Part 8, let me note the difference. RAG finds and inserts relevant content from external documents, while the memory covered here manages the conversation so far. The two are used together: pull internal documents with RAG while compressing the growing conversation history with summaries.
Where people commonly trip up #
- Piling up history indefinitely — without a shrinking mechanism, cost keeps rising and at some point the call fails due to the context limit. For conversations that may grow long, put a shrinking strategy in from the start.
- Trimming the system —
systemis a separate parameter, so it is not affected when you trimmessages. If you put role guidance insidemessages, it can get trimmed away, so keep guidance insystem. - Storing only text with compaction — when using compaction, putting only the text rather than the whole
response.contentinto the history loses the compaction state.
Wrapping up #
In this post we organized how to handle the history of a growing conversation.
- History cannot be kept piling up, because of token cost and the context limit.
- A sliding window keeps only the recent and is simple; summaries compress and preserve old information.
- Instead of managing it yourself, you can leave it to the server’s compaction feature.
So far we have covered single tool calls, retrieval, and memory. In the next post, “LLM App Development #10: Building an AI Agent,” we will tie these pieces together to build an agent that chooses its own tools and takes multiple steps to get work done.