LLM App Operations #4: Batching — Half Price for Non-Urgent Work
The savings up through Part 3 all happened inside real-time requests. This time we change the question: does this request really need an answer right now? Work like classifying documents piled up overnight, generating weekly reports, or scoring an evaluation set is perfectly fine getting results a few hours later. The Batches API takes that kind of work asynchronously, and in exchange gives a 50% discount on every input and output token. It is half price with no strings attached, so if you have work that qualifies, it is the easiest savings available.
What kind of work belongs in a batch #
There is one criterion: latency requirements. If a person is waiting in front of a screen, it is real-time. Otherwise, it is a batch candidate.
| Keep real-time | Send to batch |
|---|---|
| Chatbot responses, agent tasks | Classification, summarization, embedding preprocessing of accumulated documents |
| The “Summarize” button a user just clicked | Nightly bulk reports, weekly digests |
| Query transformation at search time | Periodic scoring of the Advanced RAG Part 6 evaluation set |
| Pre-labeling and tagging of new data |
Know the contract terms of batching as well. Up to 100,000 requests per batch, most finish within an hour, but the guarantee is within 24 hours, and results are kept for 29 days. Because of that “usually fast, but worst case 24 hours” spread, work with a tight deadline is not batch material.
Submit, wait, collect #
The unit of a batch is a bundle of individual requests, each tagged with a custom_id. The flow has three steps: submit → poll → collect results.
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
from anthropic.types.messages.batch_create_params import Request
batch = client.messages.batches.create(
requests=[
Request(
custom_id=f"doc-{doc.id}", # the key to get results back
params=MessageCreateParamsNonStreaming(
model="claude-haiku-4-5", # Part 2's routing applies here too
max_tokens=256,
system=CLASSIFY_SYSTEM,
messages=[{"role": "user", "content": doc.text}],
),
)
for doc in pending_docs
]
)
print(batch.id, batch.processing_status)import time
while True:
b = client.messages.batches.retrieve(batch.id)
if b.processing_status == "ended":
break
time.sleep(60)
for result in client.messages.batches.results(batch.id):
if result.result.type == "succeeded":
msg = result.result.message
text = next((blk.text for blk in msg.content if blk.type == "text"), "")
save_classification(result.custom_id, text) # link back to the source via custom_id
elif result.result.type == "errored":
retry_later(result.custom_id) # server errors are resubmission targetsThree operational points.
- custom_id is your lifeline. Results come back in no particular order relative to submission. The custom_id is the only thread connecting a result to its source data, so generate it with a rule that safely produces the same ID on re-runs (based on a document ID, for example).
- Success and failure are per request. The batch does not succeed or fail as a whole — results come back as a mix of
succeeded,errored, andexpired. Build the flow that picks out the errored items and resubmits them in the next batch from day one. - It stacks with the other savings. The request params are the same as a regular call, so Part 2’s model routing applies as-is, and when requests sharing the same system prompt are grouped together, Part 3’s caching kicks in too. Haiku plus batching gets you to less than a tenth of the unit cost of real-time Opus.
Operational pattern — put a queue in between #
When you turn batching from a one-off script into an always-on operation, the standard shape is a queue.
Event occurs → enqueue into a job queue (table)
└─ (cron, e.g. hourly) gather what has accumulated and submit a batch
└─ (cron) collect finished batches, apply results, re-enqueue failuresThe application only inserts a “classify this” row into a table; submission and collection are handled by periodic jobs. The advantage of this structure is natural buffering: even when traffic surges, it does not touch the real-time path’s limits (the topic of Part 5), and you can control batch sizes by how much you accumulate before each submission. If you also log the usage of batch results in the same shape as Part 1’s instrumentation wrapper, batch work shows up in your per-feature cost dashboard on the same basis.
Rethinking real-time as batch #
Beyond moving work that is already batch-shaped, sometimes you rethink the product design itself to route work through the batch path. For example, if “summarize immediately on upload” can become “notify when the summary is ready,” that feature’s cost is cut in half and it becomes resilient to traffic spikes. Not every feature can do this, but when you look at the top cost features in your Part 1 logs, occasionally asking “does this really have to be synchronous?” costs nothing.
Where people commonly trip #
- Putting deadline-bound work into a batch — the guarantee is 24 hours. If a job that “must be done by 9 AM” goes into a batch the night before, you will occasionally break the promise. If there is a deadline, build in slack or keep it real-time.
- Forgetting to collect results — if you submitted but the collection job is dead, you pay the cost and the results vanish after 29 days. Whether the collection cron is alive is itself a monitoring target.
- Making custom_id a sequence number — an ID that gets a different number on re-runs makes it impossible to link results back to their sources. Use IDs derived deterministically from the source data.
Wrapping up #
In this post we moved non-real-time work onto the half-price path.
- The criterion is latency requirements. Work no human is waiting on is a batch candidate, and every token gets a 50% discount.
- custom_id design, per-request success handling, and automated collection are the three pillars of batch operations.
- It stacks with routing and caching, and a queue-backed pipeline is the standard shape for always-on operation.
That completes the cost trilogy (routing, caching, batching). The next topic is not money but staying up. In the next post, “LLM App Operations #5: Reliability — Rate Limits, Retries, Fallbacks,” we build structures that hold up against limits and failures.