LLM App Development #2: Understanding Messages and Parameters

6 min read

Part 1 got us through our first call. There we put just a single user message into messages and moved on. This time we take a proper look at the structure of messages and the main parameters you pass alongside it. You need these to convey context and instructions to Claude precisely.

A conversation is a list of messages #

messages is, as the name says, a list of messages. Each message consists of a role that says who spoke and the content itself. There are two roles.

  • user — what the user said
  • assistant — Claude’s reply

In Part 1 we included only a user message. But to continue a conversation, you have to tell Claude what was said before. Here an important property of LLM APIs shows up. The API does not remember previous turns. Each call is independent, and the server does not store what was said last time. So to continue a conversation, you resend everything so far, every time.

multi_turn.py
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "My name is Minsu."},
        {"role": "assistant", "content": "Hello Minsu, nice to meet you."},
        {"role": "user", "content": "What did I say my name was?"},
    ],
)

for block in response.content:
    if block.type == "text":
        print(block.text)

Notice that we put an assistant message into the list ourselves to replay the earlier answer. Claude reads this entire messages and responds, so it remembers the name “Minsu” correctly. Had we sent only the last user message, Claude would have no way to know the name.

So when you build a chatbot, you stack the exchanged messages in a list and send the whole thing on every call. As the conversation grows, so does this list, and the longer it gets, the higher the token cost. We will cover how to manage that in #9.

Continuing the conversation in code #

The example above wrote messages by hand. In a real chatbot you add to this list as the conversation goes. The pattern is simple. When the user speaks, append a user message; when you get the answer back, append it as an assistant message.

conversation_loop.py
import anthropic

client = anthropic.Anthropic()
messages = []

def chat(user_input: str) -> str:
    messages.append({"role": "user", "content": user_input})

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=messages,
    )
    answer = next(b.text for b in response.content if b.type == "text")

    messages.append({"role": "assistant", "content": answer})
    return answer

print(chat("My name is Minsu."))
print(chat("What did I say my name was?"))  # remembers "Minsu"

The key part is always appending the answer (assistant) back into messages. Drop this one line and on the next call Claude does not know what it just said. Only the user inputs accumulate while the answers do not, so the conversation ends up one-sided.

Setting a role and rules with the system prompt #

system is a parameter separate from messages, used to tell Claude its role and rules up front.

system_prompt.py
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system="You are a friendly Python instructor for beginners. Always include a short code example in your answers.",
    messages=[
        {"role": "user", "content": "Show me how to sort a list."}
    ],
)

The instructions you write in system affect every answer that follows. Even as the user message changes each time, the system guidance stays. In the example above, whatever the user asks, Claude answers in the voice of a friendly instructor, with example code.

It helps to separate the roles of system and user.

  • system — the role, tone, and output rules that persist throughout the conversation
  • user — the concrete request for this turn

In other words, you do not need to repeat “you are a Python instructor” in every user message. Write it once in system.

max_tokens — the cap on response length #

max_tokens is the upper limit on how many tokens the response may generate. As we saw in Part 1, exceeding it cuts the response off mid-way.

Think of a token as a fragment smaller than a word. In English a word is roughly 1 to 2 tokens, and text in languages like Korean uses more tokens per character. You do not need to worry about exact conversion now; “longer means more tokens” is enough.

Setting it too small cuts the answer off. But setting it large does no harm. You are billed only for what is actually generated, and max_tokens is merely the ceiling on how far it may go. So people usually give a generous 1024 or more.

Whether the response was cut off is shown by response.stop_reason.

  • end_turn — Claude finished what it had to say naturally.
  • max_tokens — it hit the limit and was cut off mid-way. Raise the value.

temperature — randomness in the answer #

temperature controls how varied the answer is. Near 0 you get similar, stable answers each time; near 1 you get more varied, creative ones. The range is 0 to 1.

temperature.py
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    temperature=0.0,
    messages=[
        {"role": "user", "content": "Write a one-line commit message for a bug fix."}
    ],
)

Which value to use depends on the nature of the task.

  • Low (near 0) — when consistency and accuracy matter, like fact extraction, classification, or code generation
  • High (near 1) — when you want varied expression, like ad copy or brainstorming

One clarification. temperature=0 does not mean you get a character-for-character identical answer every time. It means “more stable,” not “exactly the same.”

Note
temperature is available on Sonnet and Haiku. The most capable tier, the latest Opus models (such as claude-opus-4-8), manage randomness internally and do not accept this parameter; passing it raises an error. On this series’ default model, claude-sonnet-4-6, it works fine.

Where people commonly trip up #

These are problems you often run into when working with messages.

  • The first message must be user — Putting assistant as the first item in messages raises an error. A conversation always starts with the user.
  • Claude does not remember what it just said — Usually you did not send the prior conversation. The API stores no state, so in an ongoing conversation you must include all the messages up to that point.
  • The system guidance does not take effect — Check that you did not write the role guidance inside a user message. Guidance goes in the separate system parameter.

Wrapping up #

In this post we organized the structure of messages and the core parameters you pass with a call.

  • A conversation is a list of messages with a role, and since the API stores no state, multi-turn means resending the full history every time.
  • system is the role and rules that persist throughout; user is this turn’s request.
  • max_tokens is the cap on response length; check stop_reason for whether it was cut off.
  • temperature controls randomness in the answer. 0 is stable; higher is more varied.

In the next post, “LLM App Development #3: Streaming Responses in Real Time,” we will cover receiving the response not all at once but streaming it to the screen as it is generated. It greatly cuts the perceived wait for long answers.

X