AI Agent Development #4: Context Management for Long-Running Work
In Part 3 we built an agent that plans and verifies multi-step work. But once the steps run past a few dozen, a new problem appears. Every turn of the loop grows messages, and eventually you hit the limit of the context window. Even before the limit, it is already a problem. The longer the input, the more each call costs, and as stale information piles up, judgment suffers too. In this post we cover context management for surviving long-running work.
In Part 9 of LLM App Development we covered conversation memory for chatbots. That was about “remembering the conversation with the user”; this post is about “coping with the tool results that pile up within a single task.” Both are context management problems, but the target is different.
Seeing what occupies the context #
In an agent conversation, most of the bulk is neither user messages nor Claude’s answers but tool results. Dozens of search hits, whole file contents, and API response JSON pile up at every step. So the first principle of context management is simple. Manage the tool results and most of the problem is solved.
Putting tool results on a diet — shrinking them on the way in #
The most effective method is to never produce big results in the first place. When you build a tool, cap the size of what it returns.
MAX_RESULT_CHARS = 4000
def search_docs(query: str) -> str:
results = do_search(query)
text = format_results(results[:10]) # top 10 only
if len(text) > MAX_RESULT_CHARS:
text = text[:MAX_RESULT_CHARS] + "\n…(Results truncated. Try again with a narrower query.)"
return textWhat matters is writing into the result both the fact that you truncated it and what to do about it. It is the same error-message principle from Part 2. Claude reads that guidance, narrows the query, and tries again. For a file-reading tool, have it take a range instead of the whole file; for a listing tool, add pagination. For each tool, leave open a path to “fetch a little at a time.”
Clearing out old tool results #
Even if you shrink them on the way in, they still grow as steps accumulate. The second method is to empty tool results that have served their purpose, leaving only a placeholder behind. A search result has often finished its job the moment Claude reads it and decides the next action.
def prune_tool_results(messages: list, keep_recent: int = 3) -> list:
"""Replace tool result bodies with a placeholder, except for the most recent N turns."""
pruned = []
for i, msg in enumerate(messages):
old = i < len(messages) - keep_recent * 2
if old and msg["role"] == "user" and isinstance(msg["content"], list):
new_content = []
for block in msg["content"]:
if isinstance(block, dict) and block.get("type") == "tool_result":
block = {**block, "content": "(old tool result has been cleared)"}
new_content.append(block)
msg = {**msg, "content": new_content}
pruned.append(msg)
return prunedThe point is to keep the structure of the conversation (the fact that a tool was called and a result came back) and empty only the body. If you delete the structure too, the API errors out when it validates the pairing of tool_use and tool_result. Run this cleanup before the call on every loop, and the context stays at roughly the size of the recent work.
Summary compression — the past as one chunk #
If the work is so long that clearing is not enough, summarize the past wholesale and replace it. You cut off the old stretch, have Claude summarize the progress so far in a separate call, and substitute that one summary chunk for the stretch. The context of the progress survives while the bulk shrinks dramatically.
There is also a path where you do not implement this yourself. Turn on the API’s compaction feature (beta), and the server summarizes earlier content on its own as the context approaches the limit.
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-8",
max_tokens=16000,
tools=tools,
messages=messages,
context_management={"edits": [{"type": "compact_20260112"}]},
)
messages.append({"role": "assistant", "content": response.content})Just one thing to watch. The response comes with a compaction block carrying the summary mixed in, and you must put response.content back into the conversation whole. If you extract and stack only the text, the summary state silently disappears and the compression comes undone.
A file scratchpad — writing things down outside the context #
The last method flips the premise. Instead of piling important information into the context, have the agent write it to a file outside the context. Give the agent tools to read and write a notes file, and put the usage rules in the system prompt.
SYSTEM = """...
- Record important facts learned during the work and remaining to-dos in notes.md.
- When resuming long-running work, read notes.md first before starting.
"""This way, even if you aggressively clear or summarize tool results, the key information stays alive in the file. Think of the context as the screen you are looking at right now, and the file as a notebook. Files survive a process restart, so this also becomes the foundation for long work that spans sessions.
What to apply first #
You do not need to build all four. In order of effect versus cost, here is what I recommend.
- Tool result caps — a few lines of tool code, and the biggest effect. Always apply this.
- Clearing old results — add it if work spanning dozens of steps is common.
- Compaction — turn it on if the work actually reaches the context limit itself.
- Scratchpad — adopt it for work that spans sessions or runs for days.
Common context pitfalls #
- Stacking only extracted text — compaction or plain response,
response.contentcan contain blocks other than text. Preserve the content whole in the conversation. - Breaking the tool_result pairing — if you delete the
tool_resultblocks themselves in the name of cleanup, they no longer pair with the remainingtool_useand you get a 400 error. Empty only the body and keep the block. - Truncating without saying so — if you cut a result silently, Claude believes that is all there is and proceeds. Write into the result the fact that it was truncated and how to fetch more.
Key takeaways #
In this post we covered four ways to cope with the context in long-running work.
- The main culprit for bulk is tool results. Making tools return less from the start is the cheapest fix.
- Empty the bodies of results that have served their purpose; if it grows longer still, replace the past with a summary or turn on server-side compaction.
- Have important facts written to a file scratchpad so they are preserved outside the context.
So far the story has been about growing a single agent. In the next post, “AI Agent Development #5: Dividing Work with Subagents,” we cover splitting work across multiple agents — which is also another answer to the context problem.