LLM App Development #3: Streaming Responses in Real Time
Through Part 2, our calls waited until the response was fully built and received it all at once. That is fine for short answers, but a long one keeps the screen frozen for the several seconds it takes. In this post we cover streaming, where you receive tokens as they are generated and flow them to the screen.
Why streaming #
messages.create returns the result only after Claude has finished the entire answer. The longer the answer, the longer you wait, and the user stares at a blank screen meanwhile.
But streaming works differently. As Claude generates tokens one by one, you receive them and print them immediately. It is the way text trickles out in ChatGPT or the Claude web app. The total time to finish the answer is similar, but because the first characters appear much sooner, the perceived delay drops sharply. To the user it feels like “answering now” rather than “frozen.”
Basic streaming #
Streaming uses messages.stream with a with statement instead of messages.create.
import anthropic
client = anthropic.Anthropic()
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[
{"role": "user", "content": "Tell me a short story about space."}
],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)stream.text_stream hands you the generated text pieces in order. You receive them in a loop and print right away. end="" on print keeps line breaks from sneaking in between pieces, and flush=True pushes to the screen immediately instead of buffering. The text then appears as a flow.
The answer itself is the same as with messages.create. Only the way you receive it differs. create gives it all at once after it is built; stream gives it bit by bit as it is built.
Getting the full response after it ends #
During streaming, only text pieces flow in. But sometimes you need the complete message after it ends, for example the number of tokens used. For that, use get_final_message.
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[
{"role": "user", "content": "Tell me a short story about space."}
],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
final = stream.get_final_message()
print(f"\n\nOutput tokens used: {final.usage.output_tokens}")final has the same structure as the response from Parts 1 and 2. It holds the list of content blocks and information like usage as is. You stream to the screen in real time, and after it ends you can still have the complete response in hand.
stream directly instead of text_stream and check the event type (event.type). You can tell where a text block starts, whether it is a reasoning block, and so on. But this finer control becomes necessary at the tool-calling (#6) or agent (#10) stage; for now text_stream is enough.Applying streaming to multi-turn conversations #
Putting streaming on top of the multi-turn conversation from Part 2 brings it close to a real chatbot. You stack the exchanged messages in a list while flowing the answer in real time.
import anthropic
client = anthropic.Anthropic()
messages = []
def chat(user_input: str) -> None:
messages.append({"role": "user", "content": user_input})
answer = ""
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=messages,
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
answer += text
print()
messages.append({"role": "assistant", "content": answer})
chat("Show me Python code that prints a triangle of stars.")
chat("Wrap that code in a function.")The text flows to the screen in real time while you also collect the full answer in answer. When streaming ends, you add that answer to the list as an assistant message. This is what lets Claude remember the previous answer on the next turn. It is exactly the accumulation pattern from Part 2, with only the way of receiving changed to streaming.
Streaming asynchronously #
In environments that handle many requests at once, like web servers, an async approach fits better. The SDK provides an AsyncAnthropic client, and its usage is almost the same as the synchronous version. with becomes async with, and the loop becomes async for.
import asyncio
import anthropic
client = anthropic.AsyncAnthropic()
async def main():
async with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[
{"role": "user", "content": "Tell me a short story about space."}
],
) as stream:
async for text in stream.text_stream:
print(text, end="", flush=True)
asyncio.run(main())The skeleton overlaps with the synchronous version exactly. For a web service with many concurrent connections, writing it async from the start pays off later.
Flowing all the way to the web screen #
So far we printed to the console on the server. To flow all the way to the user’s browser in real time, you connect text_stream to a streaming response the server sends to the client. With FastAPI you use StreamingResponse.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import anthropic
app = FastAPI()
client = anthropic.AsyncAnthropic()
@app.get("/chat")
async def chat(q: str):
async def generate():
async with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{"role": "user", "content": q}],
) as stream:
async for text in stream.text_stream:
yield text
return StreamingResponse(generate(), media_type="text/plain")Each time generate yields a token, that piece flows to the browser. Real services usually wrap it in SSE (Server-Sent Events) format, but the skeleton is this simple. The frontend receives this response piece by piece and appends it to the screen.
Where people commonly trip up #
- Using it without
with—messages.streamis a context manager. Use it with awithstatement so the stream is cleaned up properly after it ends. - The text does not appear in real time — Without
flush=True, output may pile up in the terminal buffer and come out all at once. If pieces appear in clumps, checkflushfirst. - Using the async client in sync code —
AsyncAnthropiconly works in anasync/awaitcontext. In a plain script, use the synchronousAnthropic, or call it inside anasyncfunction wrapped inasyncio.run, as in the example above.
Wrapping up #
In this post we covered streaming, which receives and prints the response in real time.
- Streaming receives tokens as they are generated, cutting the perceived delay to the first characters.
- Use
messages.streamwithwithand loop overtext_stream. - When putting it on a multi-turn conversation, add the answer you collected while streaming back as an
assistantmessage. - In async environments, use
AsyncAnthropicwithasync withandasync for. - If you need the whole completed response afterward, use
get_final_message. - For the web, connect
text_streamto a streaming response likeStreamingResponse.
In the next post, “LLM App Development #4: Prompt Engineering in Practice,” we will cover how the quality of the answer changes depending on how you ask the same question, and how to write prompts that reliably draw out the result you want.