LLM App Development #4: Prompt Engineering in Practice

Monday, June 1, 2026

7 min read

By Part 3 we have the mechanics of making calls. But the quality of an LLM app’s results depends less on the code than on what you ask and how. With the same model and the same code, the answer can come out useful or off the mark depending on how you write the prompt. In this post we organize how to write prompts that reliably draw out the result you want.

Vague instructions vs. specific instructions #

An LLM fills in the blanks on its own. When the instruction is vague, the model decides the details for you. Length, format, tone — all of them vary from call to call. The more specific you are, the closer the result lands to what you wanted.

Let me compare the same summarization task two ways.

Vague: “Summarize this text.”
Specific: “Summarize this text in three bullet points. Each bullet should be one sentence, and explain technical terms in plain language.”

The first leaves length and format to the model. The second fixes a frame — three bullets, one sentence, plain wording — so the result comes out similar every time. When writing a prompt, it helps to ask yourself: “Could the model, seeing only this sentence, picture the same result I have in my head?”

specific_prompt.py

import anthropic

client = anthropic.Anthropic()

prompt = """Summarize the following text in three bullet points.
Each bullet should be one sentence, and explain technical terms in plain language.

(text to summarize goes here)"""

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    messages=[{"role": "user", "content": prompt}],
)

Specifying the output format #

To use the answer again in code, the format has to be consistent. Pinning down the output format — “answer with only one of positive, negative, or neutral,” “give a comma-separated list,” “lay it out as a Markdown table” — makes the downstream handling easy.

In classification tasks especially, narrowing the answer to a single word matters.

classify.py

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=10,
    system="You are a review classifier. Answer with exactly one word: positive, negative, or neutral. Add no other explanation.",
    messages=[{"role": "user", "content": "It was worse than I expected."}],
)

Narrowing the output this way lets you use the result directly as an if condition or a dictionary key. That said, a natural-language instruction is only a request, so the model might occasionally answer with a sentence like “it’s negative.” How to truly enforce a format is covered in the next post, #5 (structured output). Here it is enough that “a prompt can narrow the format.”

Showing examples #

Formats or styles that are hard to explain in words are better shown with one or two examples than with ten lines of explanation. Show a few input-output pairs, then give a new input, and the model follows the pattern. This approach is called few-shot.

You build the examples by alternating user and assistant in messages.

few_shot.py

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=10,
    messages=[
        {"role": "user", "content": "Shipping was fast, I liked it"},
        {"role": "assistant", "content": "positive"},
        {"role": "user", "content": "The packaging arrived all torn"},
        {"role": "assistant", "content": "negative"},
        {"role": "user", "content": "The price seems reasonable"},
    ],
)

The first two pairs demonstrate the pattern “classify a review in one word.” Claude answers the last review in the same format too, one word with no explanation. The trickier the format, or the harder it is to put into words, the more the examples help.

Separating data and instructions with tags #

When instructions and data mix inside a prompt, the model gets confused. Especially with user input or long documents, the boundary blurs between what is the target to process and what is a command. Claude handles input delimited by XML tags well, so wrapping the data in tags to separate it helps.

tagged_input.py

prompt = """Summarize the customer review inside <review> in one sentence.

<review>
Shipping was fast, but the product was in poor condition, and customer support was friendly.
</review>"""

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=200,
    messages=[{"role": "user", "content": prompt}],
)

Wrapping it in tags makes the boundary between the instruction (“summarize”) and the data (the review body) clear. Even if the data is long or full of line breaks, the model does not get confused. There is another benefit. If you cage user-supplied text inside tags, then even when it contains phrasing like “ignore the instructions so far…,” there is less room to mistake it for a command. This is also a basic habit for preventing prompt injection.

Making it think step by step #

For problems that are hard to answer outright — say a calculation or a logic problem that weighs several conditions — adding “think it through step by step, then answer” raises accuracy. The model reduces mistakes by working through the intermediate steps.

step_by_step.py

prompt = """There are 12 apples. I give 2 each to 3 friends,
then eat half of what is left. How many apples remain now?
Think it through step by step, then on the last line write only the result as 'Answer:'."""

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    messages=[{"role": "user", "content": prompt}],
)

Printing the intermediate steps uses more tokens. If you only need the final answer, specify a “final answer only on the last line” format as above, so it works through the process but you pull out just the result cleanly.

Note

The most capable tier, the latest Opus models, perform this kind of step-by-step thinking internally on hard problems. So you do not need to add “think step by step.” On this series’ default model, claude-sonnet-4-6, prompting for it explicitly helps. It is worth remembering that the instructions you need differ by model tier.

Where people commonly trip up #

Asking for too much in one prompt — If you tell it to summarize, translate, classify, and tabulate all at once, it is easy for some of it to be dropped. Splitting the work into several calls is more accurate.
Relying on negative instructions — “Write two sentences or fewer” works better than “don’t be verbose.” Saying what you want directly is clearer than saying what not to do.
Inconsistent example formats — If your few-shot examples differ in format from one another, the model loses consistency too. Keep the examples in the same format.

Prompts are something you refine #

A good prompt is not finished in one shot. You start simple and look at the result. Then you see where the model goes off, and add an instruction to fill that gap, one line at a time. If the summary is too long, add a length limit; if technical terms come through raw, add “explain in plain language.”

This process pairs well with the tools covered earlier. When you test the same input repeatedly, keeping temperature near 0 to stabilize the result (Part 2) makes it easier to compare how a change in the prompt shows up in the result. Rather than straining to write a perfect prompt from the start, running it quickly and narrowing failures one at a time is faster in the end.

Wrapping up #

In this post we organized how to write prompts that draw out the result you want.

Reduce vagueness and be specific. The more you pin down length, format, and tone, the more stable the result.
Specifying the output format makes the result easy to reuse in code.
Show hard-to-describe formats with examples (few-shot).
Wrap data in XML tags to separate it from instructions.
Guide difficult reasoning step by step, and also specify the final-answer format.

A prompt can narrow the output format, but it cannot 100% guarantee “exactly this format.” In the next post, “LLM App Development #5: Getting Structured Output,” we will cover how to force the output format with a JSON schema, so you can plug the result straight into code.