LLM App Development #3: Streaming Responses in Real Time

6 min read

Through Part 2, our calls waited until the response was fully built and received it all at once. That is fine for short answers, but a long one keeps the screen frozen for the several seconds it takes. In this post we cover streaming, where you receive tokens as they are generated and flow them to the screen.

Why streaming #

messages.create returns the result only after Claude has finished the entire answer. The longer the answer, the longer you wait, and the user stares at a blank screen meanwhile.

But streaming works differently. As Claude generates tokens one by one, you receive them and print them immediately. It is the way text trickles out in ChatGPT or the Claude web app. The total time to finish the answer is similar, but because the first characters appear much sooner, the perceived delay drops sharply. To the user it feels like “answering now” rather than “frozen.”

Basic streaming #

Streaming uses messages.stream with a with statement instead of messages.create.

streaming.py
import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=[
        {"role": "user", "content": "Tell me a short story about space."}
    ],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

stream.text_stream hands you the generated text pieces in order. You receive them in a loop and print right away. end="" on print keeps line breaks from sneaking in between pieces, and flush=True pushes to the screen immediately instead of buffering. The text then appears as a flow.

The answer itself is the same as with messages.create. Only the way you receive it differs. create gives it all at once after it is built; stream gives it bit by bit as it is built.

Getting the full response after it ends #

During streaming, only text pieces flow in. But sometimes you need the complete message after it ends, for example the number of tokens used. For that, use get_final_message.

streaming_final.py
with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=[
        {"role": "user", "content": "Tell me a short story about space."}
    ],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

    final = stream.get_final_message()
    print(f"\n\nOutput tokens used: {final.usage.output_tokens}")

final has the same structure as the response from Parts 1 and 2. It holds the list of content blocks and information like usage as is. You stream to the screen in real time, and after it ends you can still have the complete response in hand.

Note
For finer control, you can iterate stream directly instead of text_stream and check the event type (event.type). You can tell where a text block starts, whether it is a reasoning block, and so on. But this finer control becomes necessary at the tool-calling (#6) or agent (#10) stage; for now text_stream is enough.

Applying streaming to multi-turn conversations #

Putting streaming on top of the multi-turn conversation from Part 2 brings it close to a real chatbot. You stack the exchanged messages in a list while flowing the answer in real time.

streaming_chat.py
import anthropic

client = anthropic.Anthropic()
messages = []

def chat(user_input: str) -> None:
    messages.append({"role": "user", "content": user_input})

    answer = ""
    with client.messages.stream(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=messages,
    ) as stream:
        for text in stream.text_stream:
            print(text, end="", flush=True)
            answer += text
    print()

    messages.append({"role": "assistant", "content": answer})

chat("Show me Python code that prints a triangle of stars.")
chat("Wrap that code in a function.")

The text flows to the screen in real time while you also collect the full answer in answer. When streaming ends, you add that answer to the list as an assistant message. This is what lets Claude remember the previous answer on the next turn. It is exactly the accumulation pattern from Part 2, with only the way of receiving changed to streaming.

Streaming asynchronously #

In environments that handle many requests at once, like web servers, an async approach fits better. The SDK provides an AsyncAnthropic client, and its usage is almost the same as the synchronous version. with becomes async with, and the loop becomes async for.

async_streaming.py
import asyncio
import anthropic

client = anthropic.AsyncAnthropic()

async def main():
    async with client.messages.stream(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[
            {"role": "user", "content": "Tell me a short story about space."}
        ],
    ) as stream:
        async for text in stream.text_stream:
            print(text, end="", flush=True)

asyncio.run(main())

The skeleton overlaps with the synchronous version exactly. For a web service with many concurrent connections, writing it async from the start pays off later.

Flowing all the way to the web screen #

So far we printed to the console on the server. To flow all the way to the user’s browser in real time, you connect text_stream to a streaming response the server sends to the client. With FastAPI you use StreamingResponse.

fastapi_streaming.py
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import anthropic

app = FastAPI()
client = anthropic.AsyncAnthropic()

@app.get("/chat")
async def chat(q: str):
    async def generate():
        async with client.messages.stream(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            messages=[{"role": "user", "content": q}],
        ) as stream:
            async for text in stream.text_stream:
                yield text

    return StreamingResponse(generate(), media_type="text/plain")

Each time generate yields a token, that piece flows to the browser. Real services usually wrap it in SSE (Server-Sent Events) format, but the skeleton is this simple. The frontend receives this response piece by piece and appends it to the screen.

Where people commonly trip up #

  • Using it without withmessages.stream is a context manager. Use it with a with statement so the stream is cleaned up properly after it ends.
  • The text does not appear in real time — Without flush=True, output may pile up in the terminal buffer and come out all at once. If pieces appear in clumps, check flush first.
  • Using the async client in sync codeAsyncAnthropic only works in an async/await context. In a plain script, use the synchronous Anthropic, or call it inside an async function wrapped in asyncio.run, as in the example above.

Wrapping up #

In this post we covered streaming, which receives and prints the response in real time.

  • Streaming receives tokens as they are generated, cutting the perceived delay to the first characters.
  • Use messages.stream with with and loop over text_stream.
  • When putting it on a multi-turn conversation, add the answer you collected while streaming back as an assistant message.
  • In async environments, use AsyncAnthropic with async with and async for.
  • If you need the whole completed response afterward, use get_final_message.
  • For the web, connect text_stream to a streaming response like StreamingResponse.

In the next post, “LLM App Development #4: Prompt Engineering in Practice,” we will cover how the quality of the answer changes depending on how you ask the same question, and how to write prompts that reliably draw out the result you want.

X