12 Chapter

Health Checks

A walkthrough of how Kubernetes judges whether a container is alive and ready to receive traffic. It covers the role separation of the three probes, liveness · readiness · startup; the httpGet · tcpSocket · exec check methods; tuning parameters such as initialDelaySeconds · periodSeconds · failureThreshold; the cascading failure that happens when you put an external dependency in liveness; and graceful shutdown with terminationGracePeriodSeconds and the preStop hook.

Up through Chapter 11, resources.requests / limits, we focused on how much resource to give a Pod. With CPU · memory requests and limits, the scheduler and cgroup determine the conditions under which that Pod runs. But having enough resources is a different question from whether the container is actually doing work. The process may be running while deadlocked inside, and a container that has just started may not yet have a full DB connection pool, so it should not receive traffic. This chapter walks through how Kubernetes answers those two questions, “is it alive?” and “is it ready to receive traffic?”, and the three probe types that form the basis of those answers.

By the end of this chapter, you should be able to read in a single line which operational scenario the three probe blocks in a Pod manifest, along with terminationGracePeriodSeconds and preStop, are preventing.

Why split into three probes #

When you first see probes, one natural question arises: “Isn’t it enough to check only whether the container is alive?” From an operational point of view, that question is not simple because two different meanings are packed into the single word “alive.” There is a state where the process is up and healthy at the OS level, but inside it cannot finish warming the cache and will immediately return 502s if it receives traffic. For that container, the answer to “should it be restarted?” is “no,” while the answer to “is it OK to send traffic?” is “not yet.” The two answers differ.

Kubernetes separates those two answers into different checks — liveness and readiness. It adds one more layer for apps that are slow to start — startup. The roles of the three probes are summarized below.

Probe	The question it asks	K8s’s action on failure	Scope of effect
liveness	Is this container alive	Restart that container	That one container
readiness	Is this Pod ready to receive traffic	Remove that Pod from the Service Endpoints	Traffic routing
startup	Has this container finished starting	Terminate that container (and restart it per restartPolicy)	The container startup phase

The decisive difference among the three probes is what a failure causes. A liveness failure restarts the container, a readiness failure cuts off traffic, and a startup failure ends the startup grace period. If you write a manifest without understanding that difference, you will quickly run into incidents such as “the container is alive but 502s show up” or “a healthy app falls into an infinite restart loop.”

A container restart and a Pod recreation are different things #

Let’s clear up one common source of confusion. A liveness failure results in a container restart, not a Pod recreation. The Pod stays alive, and only the container inside is terminated and restarted in the same Pod. The RESTARTS column of kubectl get pods increasing from 1 to 2 to 3 is the signal. The Pod itself does not move to another node or receive a new IP. Meanwhile, a readiness failure does not touch the container — it stays alive, but it is excluded from the Endpoints list of Chapter 5, Service, so traffic does not reach it.

The three check methods — httpGet / tcpSocket / exec #

All three probes can use one of the same three check methods. Each has a different suitable scenario and cost.

Method	Behavior	Suitable workload	Cost
httpGet	HTTP GET to the specified path / port. Success if the response is 200 ~ 399	HTTP servers (most web · API)	Low
tcpSocket	Attempts a TCP connection to the specified port. Success if connected	Non-HTTP servers (DB, some gRPC, Redis)	Very low
exec	Runs a command inside the container. Success if exit 0	Workloads needing a check via an arbitrary script	High (forks a new process)

httpGet — the most common choice #

For most web · API servers, httpGet is the first candidate.

httpGet probe — excerpt

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
    httpHeaders:
      - name: X-Probe
        value: kubelet
  initialDelaySeconds: 10
  periodSeconds: 10

The path /healthz is a convention in the Kubernetes ecosystem — you’ll often see names like /health, /healthz, /ping, or /-/healthy in project code. If the response code is in the 200 ~ 399 range, it counts as a success; 4xx · 5xx counts as a failure. The response body is not inspected.

The advantage of httpGet is that the app code can express its own state directly. It can distinguish between meanings such as “the process is up,” “the DB connection pool is healthy,” and “the cache is warm” by returning 200 or 503.

tcpSocket — just needs the port open #

For non-HTTP servers, tcpSocket is the natural choice.

tcpSocket probe — excerpt

readinessProbe:
  tcpSocket:
    port: 5432
  initialDelaySeconds: 5
  periodSeconds: 10

Non-HTTP servers like PostgreSQL, MySQL, and Redis are common targets. Kubernetes attempts a TCP three-way handshake to that port: success means OK, failure means not OK. However, a successful TCP connection does not mean the server can actually process queries. For example, PostgreSQL may already be listening even though startup is not finished. For a database workload’s readiness, running a command such as pg_isready with exec is more accurate than using tcpSocket.

exec — a check via an arbitrary command #

A check expressed only by a specific command uses exec.

exec probe — excerpt

readinessProbe:
  exec:
    command:
      - /bin/sh
      - -c
      - pg_isready -h 127.0.0.1 -p 5432
  initialDelaySeconds: 5
  periodSeconds: 10

exec forks a new process inside the container, runs the command, and succeeds if the exit code is 0. It is the most flexible option, but also the most expensive. Forking is not free, and if the command goes through sh and then launches a client binary, every check becomes heavier. Even a check that runs once a minute can add up when hundreds of containers are involved. The operational rule of thumb is to prefer httpGet first, and fall back to tcpSocket or exec only when necessary.

Common parameters — time and thresholds #

The three probes share the same time parameters. We organize them in one table.

Field	Meaning	Default
`initialDelaySeconds`	The time to wait after the container starts until the first check	0
`periodSeconds`	The check interval	10
`timeoutSeconds`	The upper bound of time one check waits for a response	1
`failureThreshold`	How many consecutive failures count as a final failure	3
`successThreshold`	How many consecutive successes count as a final success (fixed at 1 for liveness / startup)	1

These five values completely determine a probe’s behavior. For example, periodSeconds: 10 and failureThreshold: 3 mean that consecutive failures must continue for up to 30 seconds before Kubernetes treats the probe as a real failure. timeoutSeconds: 1 means that if one check does not respond within 1 second, that round is treated as a failure.

The defaults are often too aggressive to use in operations. In particular, timeoutSeconds: 1 can fail when a GC runs a little long or the node’s load briefly spikes. If that default is used for liveness, a temporary response delay can trigger a container restart. The CPU throttling described in Chapter 11 also increases response latency at the same time, so the resource model and probe parameters interact. In operational manifests it is usually safer to raise timeoutSeconds to 3 ~ 5 seconds and set failureThreshold to around 3 ~ 5 as well.

liveness probe — is it alive #

The role of the liveness probe is to find a container that is dead but pretending not to be. Typical targets are states where the process is up but deadlocked and cannot respond to any request, or where a memory leak has pushed response time toward infinity. When liveness fails, Kubernetes terminates the container with SIGTERM → SIGKILL on timeout and brings it back up according to the Pod’s restartPolicy. Because the Deployment default is Always, automatic restart applies to nearly all workloads.

liveness probe — Deployment excerpt

spec:
  template:
    spec:
      containers:
        - name: web
          image: myapp:1.4.0
          ports:
            - containerPort: 8080
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3

The meaning of this manifest is as follows.

For 30 seconds after the container starts, it doesn’t check (initialDelaySeconds).
After that, it calls /healthz every 10 seconds (periodSeconds).
If one call doesn’t respond within 3 seconds, that round is treated as a failure (timeoutSeconds).
After 3 consecutive failures (failureThreshold), it sees a liveness failure and restarts the container.

When a check truly fails and the container is restarted, traces remain in the events of kubectl describe pod and the RESTARTS count of kubectl get pods.

after a liveness failure — kubectl describe pod

Events:
  Type     Reason     Age   From     Message
  ----     ------     ----  ----     -------
  Warning  Unhealthy  2m    kubelet  Liveness probe failed: HTTP probe failed with statuscode: 503
  Normal   Killing    2m    kubelet  Container web failed liveness probe, will be restarted
  Normal   Pulled     2m    kubelet  Container image "myapp:1.4.0" already present on machine
  Normal   Created    2m    kubelet  Created container web
  Normal   Started    2m    kubelet  Started container web

The event pair Liveness probe failed followed immediately by Container ... will be restarted is the standard pattern. If you see that trace often, you should suspect the liveness probe — you need to determine whether the container is really failing often, or whether the probe is too aggressive and is killing a healthy container. The full diagnostic tree is organized in Chapter 27, kubectl debugging patterns.

What to put in liveness #

This is the part where most operational mistakes happen. To state the conclusion first: the liveness probe should look only at the process’s own state. You must not put external dependencies (DB, cache, other microservices) in liveness.

The reason is cascading failure. If the DB goes down briefly and all app containers fail liveness at the same time, they can all restart at once; even after the DB recovers, the apps may take a long time to come back. In a worse case, the restarted app still cannot reach the DB, liveness fails again, and the restart loop continues forever. Use liveness for the container’s internal state only, and reserve external dependencies for readiness — it is safer to keep that separation from the start.

The /healthz endpoint usually checks only this much:

The app process can produce a response (it reached the HTTP handler).
Its own internal deadlock detection is OK.

The operational standard is to never put a DB ping or an external service call in this endpoint.

readiness probe — is it ready to receive traffic #

The role of the readiness probe is the gate for traffic routing. Unlike liveness, readiness does not kill the container — instead, it removes that Pod from the Service’s Endpoints list. As a result, no new requests reach that Pod.

readiness probe — Deployment excerpt

spec:
  template:
    spec:
      containers:
        - name: web
          image: myapp:1.4.0
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /readyz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
            successThreshold: 1

The pattern of keeping /readyz as a separate endpoint from /healthz is common. It’s because what the two endpoints look at differs.

/healthz (liveness) — its own process’s state only
/readyz (readiness) — its own process + DB ping + cache connection + the state of the external services it depends on

A Pod whose readiness failed isn’t killed and stays alive, but its traffic is just cut off briefly. While the DB connection temporarily can’t be made, readiness becomes false and traffic is cut off, and when the DB recovers readiness goes back to true and traffic flows in again. It’s a model that absorbs a temporary failure without restarting the container.

Checking the shape of being removed from Endpoints #

Let’s look briefly at how Endpoints (or its successor object, EndpointSlice) changes when a readiness failure happens.

Service and Endpoints

kubectl get svc web
kubectl get endpoints web

when readiness is all normal

NAME   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
web    ClusterIP   10.96.123.45    <none>        80/TCP    1d

NAME   ENDPOINTS                                       AGE
web    10.244.1.10:8080,10.244.1.11:8080,10.244.2.5:8080   1d

All three Pods’ IPs being in Endpoints is the normal state. When one Pod’s readiness drops to failure, only that IP is removed from Endpoints.

when one Pod's readiness is false

NAME   ENDPOINTS                                       AGE
web    10.244.1.10:8080,10.244.2.5:8080                1d

The Service doesn’t send traffic to that Pod. In kubectl get pods that Pod shows as READY 0/1, and the container is still Running.

kubectl get pods — on readiness failure

NAME           READY   STATUS    RESTARTS   AGE
web-7c4d-aa1   1/1     Running   0          1d
web-7c4d-bb2   0/1     Running   0          1d
web-7c4d-cc3   1/1     Running   0          1d

The 0/1 in the READY column is the key. It means one container is up and 0 of them are ready, and in this state RESTARTS does not increase.

When there are several containers in a Pod #

If a Pod has several containers and one of them has readiness false, the whole Pod’s ready becomes false and it’s removed from Endpoints. Even if two containers are fine, if just one container doesn’t pass readiness, no traffic reaches the whole Pod. It’s intended behavior — the Pod is K8s’s routing unit, and if one part inside it isn’t ready, it’s safer not to send traffic to that Pod.

startup probe — the guardian of apps slow to start #

The third probe, startup, is a relatively new object that became beta in 1.16 and stable in 1.18. The problem it solves is clear — apps that are slow to start.

Workloads like Java / Spring Boot, Rails, and ones that load a large ML model into memory commonly take more than 60 seconds to start. Let’s follow what happens if you put liveness only, with no startup probe, on such an app. If the app takes 60 seconds to start but liveness’s initialDelaySeconds: 10 — from the 10-second mark of startup K8s calls /healthz, and since the app can’t respond yet, failures accumulate and eventually the container dies. Even when K8s brings it back up, the same thing repeats and it falls into an infinite restart.

The workaround is to set initialDelaySeconds large like 90 or 120 seconds, but then a new problem arises — even when the app truly dies, it’s detected that much later. Even if a deadlock occurs during operation, for the first 90 seconds it’s defenseless. The price of raising initialDelaySeconds to fit the startup time comes back as a drop in sensitivity in normal operation.

The startup probe cleanly solves this separation. Until startup succeeds, liveness and readiness are inactive, and once startup succeeds, after that startup doesn’t run again and liveness / readiness run on their normal cycle.

startup probe — excerpt

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 30

The manifest above means it allows up to 5 minutes (10 seconds × 30 rounds) for startup. If /healthz responds with 200 even once within 5 minutes, startup is judged a success, and after that the startup probe doesn’t run anymore. From then on liveness / readiness run on their usual cycle. If it never succeeds even once after 5 minutes pass, a startup failure is judged, the container is terminated, and brought back up per restartPolicy.

The key formula is simple. failureThreshold × periodSeconds is the maximum time allowed for startup. If Spring Boot averages 60 seconds and occasionally takes 90 seconds, a calculation like failureThreshold: 12 × periodSeconds: 10 = 120 seconds is common.

Three common operational accident scenarios #

With the three probes’ model at hand, let’s note three accidents frequently met in operations. Avoiding just these three makes a large portion of health-check-related accidents disappear.

Accident 1 — liveness only, no readiness #

This is the most common first accident. If you write only liveness and not readiness in the manifest, K8s judges that Pod ready as soon as the container starts. That Pod is immediately added to the Service’s Endpoints and traffic comes in.

The problem is when the container has just started. The process is up and listening has started, but the DB connection pool isn’t filled yet or the cache hasn’t been preloaded. The traffic that comes in at that moment spits out 502s and the user is affected. If you see a short 502 burst every time during a rolling update, you should suspect a missing readiness first.

The solution is simple — add a readiness probe, and make its /readyz endpoint look at the DB ping · cache state and split a 200 / 503 response. Then that Pod doesn’t enter Endpoints until it’s truly ready to receive traffic.

Accident 2 — liveness too aggressive #

The second accident is in the liveness parameters themselves. If you start operating with the default timeoutSeconds: 1, the moment the DB briefly slows down or GC runs long, the health check can’t respond within 1 second and fails. After 3 consecutive failures a container restart fires, and the container right after the restart runs GC again, responds late again, and fails again.

Once this cycle starts it’s hard to stop. The same pattern repeats until the operator raises timeoutSeconds in the manifest. It’s safer to start an operational manifest’s liveness with something like timeoutSeconds: 3 ~ 5 and failureThreshold: 3 ~ 5.

Accident 3 — putting a DB ping in liveness #

The third accident happens when you write a manifest without knowing the model’s separation. If you make /healthz look even at a DB ping and respond false, then even if the DB goes down for just a moment, all app containers’ liveness fail at the same time and go into restart at the same time.

Even if the DB recovers in 30 seconds, the apps can’t come back up for a long while — for an app whose own startup takes 30 seconds or more, the 503s last that much longer. Worse, even if the app comes back up, if the DB wobbles again in the meantime it dies again and falls into the cascading failure of coming back up again.

The rule is one line. Liveness is its own process, readiness is external dependencies. Where should external dependencies like a DB · cache · other microservices go? readiness. When the DB goes down, readiness becomes false and traffic is cut off, and when the DB recovers, readiness goes back to true and traffic flows again. Since the container doesn’t die, cascading failure doesn’t happen either.

Probes and graceful shutdown #

A subject that stacks one more layer on top of the probes’ model is graceful shutdown. To keep in-flight requests from becoming 502s when a Pod is terminated, you have to cut off traffic first and then kill the container. K8s proceeds through this flow in the following steps.

The Pod enters the Terminating state.
K8s removes that Pod’s IP from Endpoints (traffic cutoff begins).
At the same time, it sends SIGTERM to the container.
It waits for the container to terminate within terminationGracePeriodSeconds (default 30 seconds).
If it doesn’t die after the time passes, it force-terminates with SIGKILL.

The subtle part here is that steps 2 and 3 happen almost simultaneously. The Endpoints update takes time while it propagates through the K8s control plane to each node’s kube-proxy, but SIGTERM arrives immediately. As a result, a time window arises where the container that received SIGTERM has just started to terminate, but the Endpoints update hasn’t fully spread yet, so a few more last requests come into that Pod. Those requests reach the terminating container and become 502s.

Filling the time window with the preStop hook #

The tool that fills this gap is the lifecycle.preStop hook. It’s a command K8s runs first, before sending SIGTERM, and usually you put a short sleep to buy time for the Endpoints update to fully spread.

preStop hook — excerpt

spec:
  template:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: web
          image: myapp:1.4.0
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10"]

The flow of the manifest above is as follows.

The Pod begins terminating → K8s removes it from Endpoints.
K8s runs the preStop hook → sleeps for 10 seconds.
During those 10 seconds the Endpoints update fully spreads across the whole cluster — no new traffic comes in.
When preStop ends, K8s sends SIGTERM to the container.
The container handles its in-flight requests internally and terminates cleanly.
If termination doesn’t finish, SIGKILL after terminationGracePeriodSeconds (60 seconds).

terminationGracePeriodSeconds includes the preStop time too. That is, in the example above, of the 60 seconds, 10 are used for preStop and the remaining 50 for termination after SIGTERM. If you set preStop to 20 seconds, the time after SIGTERM shrinks to 40 seconds, so you have to adjust the two values together.

The in-earnest operations manual for the safe termination flow during a node upgrade, together with PodDisruptionBudget, is covered in Chapter 30, Upgrade strategy.

Having the app handle SIGTERM directly #

There’s also a way to get the same effect by another path. It’s the pattern of writing in the code that when the app receives SIGTERM, its readiness endpoint responds false. As soon as SIGTERM comes in, /readyz starts responding 503, and soon K8s removes that Pod from Endpoints at the next readiness check. In the meantime it finishes handling the in-flight requests and terminates.

This method makes a clean graceful shutdown even without a preStop hook. However, there’s a premise that the app-code-level SIGTERM handler must work correctly — if the container’s PID 1 ignores SIGTERM, the readiness-false handling doesn’t happen either. The pattern where the container image’s ENTRYPOINT goes through an init tool like tini or dumb-init prevents this problem.

A combined manifest #

Here’s an example that gathers the three probes and graceful shutdown into one manifest. It assumes a Java Spring Boot app.

full-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-api
  template:
    metadata:
      labels:
        app: order-api
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: order-api
          image: myorg/order-api:2.3.0
          ports:
            - name: http
              containerPort: 8080
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              memory: "1Gi"
          startupProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            periodSeconds: 10
            failureThreshold: 18
            timeoutSeconds: 3
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8080
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
            successThreshold: 1
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10"]

If we note the intent written in the manifest one line at a time, it’s as follows.

startup probe: allows up to 180 seconds (10 seconds × 18 rounds) for startup. Spring Boot’s average startup time + headroom.
liveness probe: runs from after startup succeeds. /actuator/health/liveness looks only at its own process’s state (no DB ping).
readiness probe: /actuator/health/readiness looks at the DB connection pool and external dependencies. If the DB goes down briefly, readiness becomes false and traffic is cut off, and the container stays alive.
preStop sleep 10 seconds + terminationGracePeriodSeconds 60 seconds: secures enough of a time window for graceful shutdown.

Since Spring Boot 2.3+’s actuator provides liveness and readiness endpoints separated as a standard, you can apply such settings relatively easily. In other frameworks too, the pattern of making the same separation (own state / external dependencies) into two endpoints at the code level is the operational standard.

Docker HEALTHCHECK and K8s probes #

Let’s organize once the relationship between Docker’s HEALTHCHECK instruction and K8s probes.

HEALTHCHECK in a Dockerfile

HEALTHCHECK --interval=30s --timeout=3s \
  CMD curl -f http://localhost:8080/healthz || exit 1

This instruction is the model where, when you bring up a container directly with docker run, the docker daemon runs that check. A (healthy) / (unhealthy) mark is stamped in the STATUS column of docker ps, and places like Docker Compose’s depends_on.condition: service_healthy look at this value too.

K8s ignores this HEALTHCHECK value. What K8s looks at is only the Pod manifest’s livenessProbe / readinessProbe / startupProbe. If you put the same image on K8s, the Dockerfile’s HEALTHCHECK is simply ignored, and you have to write the probes separately in the manifest. The two models look similar but operate at different layers — Docker is responsible for one container’s check, and K8s is responsible for a K8s workload’s check.

If the image itself can be used in both, it’s fine to have a check with the same intent written in both places — but you should make it clear that the probe in the K8s manifest is the check that actually runs. The more detailed mapping between docker-compose’s healthcheck and K8s probes is organized in Appendix A, From docker-compose to k8s.

Exercises #

Assuming the main text’s full-deployment.yaml, organize as notes a scenario where you deliberately make /actuator/health/liveness respond 503. Write in time order which steps K8s goes through (probe failure count → Killing → Pulled → Created → Started), and organize how the RESTARTS count changes at the same moment, matching it against the model of §“liveness probe — is it alive.”
Assume and compare two manifests: one that deliberately put a DB ping in liveness, and one that put the same DB ping in readiness only. In a scenario where the DB goes down for 30 seconds and recovers, compare and organize in one paragraph how each shape behaves (cascading failure vs only temporary traffic cutoff), and note how it connects to Accident 3 of §“Three common operational accident scenarios.”
Assuming this chapter’s terminationGracePeriodSeconds: 60 + preStop: sleep 10 combination, reason out with the time model of §“Filling the time window with the preStop hook” what happens when you change preStop to sleep 70. Organize in one paragraph how much time is left for the container to finish its own work after SIGTERM, and how you should adjust it together with terminationGracePeriodSeconds.

In one line: liveness is “is it alive,” so on failure it restarts the container; readiness is “is it ready for traffic,” so on failure it’s removed from Endpoints; startup is the guardian of apps slow to start. Own process only for liveness, external dependencies for readiness — this separation prevents cascading failure. terminationGracePeriodSeconds and the preStop hook make the time window for graceful shutdown, and K8s ignores Docker’s HEALTHCHECK.

Next chapter #

What we’ve covered so far was a Pod’s resource model (Chapter 11) and that Pod’s health judgment (this chapter). A number like replicas: 3 was written by a human directly in the manifest. But a production cluster’s traffic swings greatly by time of day and day of week, and the shape where a human adjusts replicas by hand each time isn’t sustainable.

Chapter 13, Autoscaling organizes the three objects that fill that gap. The HPA (Horizontal Pod Autoscaler) is the controller that automatically increases and decreases replicas according to CPU · memory · custom metrics. The VPA (Vertical Pod Autoscaler) is a different-axis model that automatically adjusts a Pod’s requests / limits themselves. The Cluster Autoscaler is a one-step-higher object that automatically adds nodes themselves when there aren’t enough nodes for Pods to be scheduled on. And in that the HPA’s input metrics are ultimately gathered only from Pods whose readiness is true, the readiness established in this chapter reappears at the next chapter’s starting point.