LLM App Operations #5: Reliability — Rate Limits, Retries, Fallbacks

6 min read

If everything up to Part 4 was about money, this one is about not stopping. In LLM API operations, 429 (rate limit) and 529 (overloaded) are not outages, they are daily life. The better your traffic day, the more often you meet them. The goal of reliability design is to keep the user experience from collapsing in the face of these everyday rejections.

How rate limits work — what actually hits the limit #

A rate limit is not a single number. Requests per minute (RPM) and tokens per minute (input and output separately) are each enforced, and exceeding any one of them gets you a 429. Two corollaries matter in operations.

  • You can hit the limit on tokens even with few requests. Long-context RAG and agent requests can hit the token limit with just a few overlapping calls. This is why the instrumentation from Part 1 tracks request counts and token volume separately.
  • Limits are per model. The routing from Part 2 is also a reliability mechanism. Traffic going to haiku does not consume opus’s limit, so routing splits the limit pipe itself.

A 429 response carries a retry-after header (come back in this many seconds). Ignoring it and retrying immediately only cements the limit further.

Retries — what the SDK gives you and what to handle yourself #

The fundamentals are exactly as we saw in AI Agent Development Part 1. The SDK automatically retries 429 and 5xx errors with exponential backoff (2 retries by default), adjustable via max_retries. From an operations perspective, there are three things to add.

retry_config.py
client = anthropic.Anthropic(
    max_retries=4,          # be generous on latency-tolerant paths like overnight batches
    timeout=60.0,           # the default (10 minutes) is too long for production; set per path
)
  • The retry budget differs per path. In a chatbot where a user is waiting, 4 retries (tens of seconds) are pointless. Keep real-time paths short (1-2 retries) and background paths long.
  • Distinguish the errors that must not be retried. 400 (bad request) and 401 (authentication) come back the same no matter how many times you resend. The SDK distinguishes these on its own, but when you layer your own retry logic on top, putting 4xx into the retry loop is a common mistake.
  • Retries are a cost too. If you log the retry count as in Part 1, you can catch the case where “cost went up but traffic stayed flat” and the culprit is a retry storm.

Timeouts and streaming — reliability for long responses #

LLM response time scales with output length, so tens of seconds is normal for long generations. Two kinds of accidents happen here: setting the timeout too short and cutting off a healthy response, or setting it too long and waiting endlessly on a dead connection. The standard fix is to switch long-response paths to streaming.

streaming_path.py
with client.messages.stream(
    model="claude-opus-4-8",
    max_tokens=16000,
    messages=messages,
) as stream:
    for text in stream.text_stream:
        push_to_user(text)            # visible to the user from the first token
    response = stream.get_final_message()

Streaming is usually introduced as a UX feature that reduces perceived latency (time to first token), but from an operations perspective it is a reliability feature. Because a liveness signal arrives with every token, the “is it dead or just slow” zone disappears. The SDK requiring streaming for requests with large max_tokens exists for the same reason (the timeout risk of long silent connections). Make streaming the default on any path where output can run long.

Fallbacks — the staged retreat when nothing else works #

For situations retries cannot solve (persistent 429s, 529s, outages), design the retreat staircase in advance. Try from the top and step down when it fails.

  1. Model downgrade — When the primary model is blocked, send the same request to a model one tier smaller. Quality drops a little, but the service continues. In effect, you add a fallback column to the routing table from Part 2.
  2. Queuing — For requests that are less time-sensitive, accept them with “this will be processed shortly” and put them on a queue. The batch pipeline from Part 4 becomes your buffer as is.
  3. Graceful failure — If it ultimately fails, fail fast and clearly. “We’re receiving a lot of requests right now, please try again shortly” is a better experience than a 30-second spinner.
fallback.py
FALLBACK = {"claude-opus-4-8": "claude-sonnet-4-6",
            "claude-sonnet-4-6": "claude-haiku-4-5"}

def call_with_fallback(model: str, **kwargs):
    try:
        return call_llm(model=model, **kwargs)
    except (anthropic.RateLimitError, anthropic.InternalServerError):
        fallback = FALLBACK.get(model)
        if fallback is None:
            raise
        logger.warning("fallback: %s -> %s", model, fallback)
        return call_llm(model=fallback, **kwargs)

Logging fallback activations matters. A fallback is symptom relief, not a cure, so frequent activations are a signal to address the root cause (requesting a limit increase, spreading traffic, cutting tokens with caching).

The side that avoids creating load — concurrency limits #

The last piece points the other way. It is about controlling the rate at which you send, not what you receive. If a traffic spike translates into an API call spike one-for-one, you ram into the limit yourself. Put a concurrency cap (a semaphore) on the call path, and the spike briefly waits in line while 429s on the API side go down. This is especially needed for code that generates concurrent calls internally, like the parallel subagents in Part 5 of the agent series or batch evaluation runs. Traffic that flows steadily within the limit also achieves higher total throughput than traffic that surges and stalls in cycles.

Where people commonly trip up #

  • Treating 429 as an outage — Rate limits are something you design for, not something you page on. But do watch the trend of the occurrence rate. A rising trend is a signal to raise limits or do reduction work.
  • Using the same timeout on every path — Classification (done in a second) and long generation (30 seconds is normal) cannot share a timeout. Put per-path timeouts in the routing table from Part 2 as well.
  • Building a fallback and forgetting it — Running quietly downgraded for weeks makes degraded quality the default. Put the fallback activation rate on a dashboard.

Wrapping up #

In this post we built a structure that does not stop.

  • Rate limits are enforced separately on request counts and token counts. Leave retry-after-respecting retries to the SDK, and set the retry budget per path.
  • On long-response paths, streaming is a reliability mechanism. Set timeouts per path as well.
  • Design fallbacks as a staircase of model downgrade → queuing → graceful failure, and monitor the activation rate. Concurrency limits on the sending side reduce limit collisions in the first place.

With cost and reliability in place, the remaining threats come from outside. In the next post, “LLM App Operations #6: Security — Prompt Injection and Data Boundaries,” we cover attempts to steer your app through its inputs.

X