K8s Intermediate #5: Health Checks — liveness / readiness / startup probes

The fifth post in the K8s Intermediate series. Through #4, what we held was the model of how much resource to give a Pod. With CPU and memory’s requests and limits, the conditions under which the scheduler and cgroup place that Pod are determined. But just because resources are sufficient doesn’t mean that container is actually doing work. The process can be up but deadlocked inside, or the container may have just started but the DB connection pool isn’t yet filled, so it shouldn’t take traffic. This post walks through how K8s judges these two questions — “is it alive” and “is it ready to take traffic” — and the three kinds of probes that ground that judgment, all in one cycle.

This series is K8s Intermediate, 7 posts.

Why split into three probes #

When seeing probes for the first time, a natural question arises — “isn’t checking just whether the container is alive enough?” The reason this question isn’t simple from an operational view is that the single phrase “alive” mixes two different meanings. There’s a state where the process is up and fine at the OS level, but the cache inside isn’t filled and any traffic immediately gets a 502. The answer to “should it be restarted?” for that container is “no,” and the answer to “should it receive traffic?” is “not yet.” The two answers differ.

K8s separates these two answers into different objects — liveness and readiness. And it adds one more guardian layer for slow-to-start apps — startup. Putting the three probes’ roles in one table:

ProbeQuestion askedK8s action on failureScope of impact
livenessIs this container aliveRestart that containerThat single container
readinessIs this Pod ready to take trafficRemove that Pod from Service EndpointsTraffic routing
startupHas this container finished startingTerminate the container (restart per restartPolicy)Container startup phase

The decisive difference among the three probes is what failure leads to. Liveness failure causes container restart, readiness failure causes traffic blocking, startup failure causes startup-phase termination. Without knowing this difference in outcomes, writing manifests leads directly to incidents like “the container is alive but throwing 502s” or “a perfectly healthy app fell into an infinite restart loop.”

Container restart and Pod recreation are different #

One often-confused point to flag in advance. The result of liveness failure is container restart, not Pod recreation. The Pod stays alive, and only the container inside is terminated and started again in the same Pod. The RESTARTS column in kubectl get pods going up 1, 2, 3 is the signal. The Pod itself doesn’t move to another node or get a new IP. Meanwhile, readiness failure doesn’t touch the container — it stays alive as is, only excluded from the Service’s Endpoints list so traffic doesn’t come in.

Three check methods — httpGet / tcpSocket / exec #

All three probes can choose one of the same three check methods. Each fits different scenarios and has different costs.

MethodBehaviorSuitable workloadCost
httpGetHTTP GET to the specified path/port. Success on 200~399 responseHTTP server (most web/API)Low
tcpSocketTCP connection attempt on the specified port. Success on connectionNon-HTTP server (DBs, some gRPC, Redis)Very low
execExecute a command inside the container. Success on exit 0Workloads needing arbitrary script checksHigh (forks new process)

httpGet — the most common choice #

For most web/API servers, httpGet is the first candidate.

httpGet probe — excerpt
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
    httpHeaders:
      - name: X-Probe
        value: kubelet
  initialDelaySeconds: 10
  periodSeconds: 10

The path /healthz is a convention in the K8s ecosystem — in project code you’ll often see names like /health, /healthz, /ping, /-/healthy. Response codes in the 200~399 range are judged success, 4xx and 5xx are failure. The response body isn’t inspected.

The strength of httpGet is that the app code can express its own state directly. Instead of just “the process is up,” it can split into 200/503 by meanings like “DB connection pool is healthy” or “cache is filled.”

tcpSocket — just the port being open #

For non-HTTP servers, tcpSocket is a natural choice.

tcpSocket probe — excerpt
readinessProbe:
  tcpSocket:
    port: 5432
  initialDelaySeconds: 5
  periodSeconds: 10

PostgreSQL, MySQL, Redis, and other non-HTTP servers are common targets. K8s attempts a TCP 3-way handshake to that port — success is OK, failure is NG. Note that TCP connectivity doesn’t mean the server can actually process queries — even a Postgres instance that has just started listening but hasn’t finished startup will accept TCP connections. So for database workload readiness, running pg_isready via exec is more accurate than tcpSocket.

exec — checks via arbitrary command #

Checks that can only be expressed as a specific command use exec.

exec probe — excerpt
readinessProbe:
  exec:
    command:
      - /bin/sh
      - -c
      - pg_isready -h 127.0.0.1 -p 5432
  initialDelaySeconds: 5
  periodSeconds: 10

exec forks a new process inside the container to run the command, with exit code 0 meaning success. The most flexible, but the most expensive. The fork itself is not a cheap operation, and if the command goes through sh and then spawns a client binary, each check is heavy. Even at one check per minute, the load adds up across hundreds of containers. The standard operational order is to consider httpGet first whenever possible, and fall back to tcpSocket or exec when that path is unavailable.

Common parameters — time and thresholds #

The three probes share the same time parameters. In one table:

FieldMeaningDefault
initialDelaySecondsTime to wait after container start before the first check0
periodSecondsCheck interval10
timeoutSecondsUpper bound on time waited for a single check’s response1
failureThresholdHow many consecutive failures count as final failure3
successThresholdHow many consecutive successes count as final success (fixed at 1 for liveness/startup)1

These five values fully decide one probe’s behavior. For example, with periodSeconds: 10 and failureThreshold: 3, K8s sees the probe as truly failed only after up to 30 seconds of consecutive failures. timeoutSeconds: 1 means a single check that doesn’t respond within 1 second is treated as a failed round.

The defaults are often too aggressive to use as-is in operation. In particular, timeoutSeconds: 1 causes failures even when GC takes slightly longer or node load briefly spikes. Leaving that default in a liveness probe means transient response delays translate directly into container restarts. In operational manifests, raising timeoutSeconds to 3–5 seconds and setting failureThreshold to about 3–5 is almost always safer.

liveness probe — is it alive #

The role of the liveness probe is to find containers that are dead but pretending not to be. The state where the process is up but in a deadlock unable to respond to any request, the state where memory leaks have stretched response time to infinity — these are the targets. When liveness fails, K8s sends SIGTERM and on timeout SIGKILL to terminate the container, then restarts it according to the Pod’s restartPolicy. The default restartPolicy for Deployments is Always, so almost every workload gets automatic restart.

liveness probe — Deployment excerpt
spec:
  template:
    spec:
      containers:
        - name: web
          image: myapp:1.4.0
          ports:
            - containerPort: 8080
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3

The meaning of this manifest:

  • After the container starts, don’t check for 30 seconds (initialDelaySeconds).
  • Then call /healthz every 10 seconds (periodSeconds).
  • Treat a single call as a failed round if it doesn’t respond within 3 seconds (timeoutSeconds).
  • After 3 consecutive failures (failureThreshold), see liveness as failed and restart the container.

When checks really fail and the container restarts, traces remain in the events of kubectl describe pod and the RESTARTS count of kubectl get pods.

After liveness failure — kubectl describe pod
Events:
  Type     Reason     Age   From     Message
  ----     ------     ----  ----     -------
  Warning  Unhealthy  2m    kubelet  Liveness probe failed: HTTP probe failed with statuscode: 503
  Normal   Killing    2m    kubelet  Container web failed liveness probe, will be restarted
  Normal   Pulled     2m    kubelet  Container image "myapp:1.4.0" already present on machine
  Normal   Created    2m    kubelet  Created container web
  Normal   Started    2m    kubelet  Started container web

The shape where Liveness probe failed and Container ... will be restarted immediately after are stamped together is the standard. Pods where this trace appears frequently should be suspected of liveness probe issues — you need to determine whether the container is genuinely crashing, or whether the probe is too aggressive and killing healthy containers.

What should go in liveness #

This is where operational incidents happen most. To put the conclusion first — liveness probe should look only at its own process state. Don’t put external dependencies (DB, cache, other microservices) into liveness.

The reason is cascading failure. If the DB briefly goes down and all app containers’ liveness fails simultaneously, they all restart at the same time. Even after the DB recovers, the apps may not come back up for some time. In worse cases, the restarted app can’t reach the DB again, fails liveness again, and falls into an infinite restart loop. Liveness for internal process state, external dependencies for readiness — locking this separation in from the start is safer.

The /healthz endpoint usually only checks:

  • The app process can produce a response (it reached the HTTP handler).
  • The internal deadlock detection is OK.

Never put DB pings or external service calls into this endpoint as the operational standard.

readiness probe — is it ready to take traffic #

The role of the readiness probe is the gate for traffic routing. Unlike liveness, readiness doesn’t kill the container — instead it removes that Pod from the Service’s Endpoints list. As a result, no new requests come into that Pod.

readiness probe — Deployment excerpt
spec:
  template:
    spec:
      containers:
        - name: web
          image: myapp:1.4.0
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /readyz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
            successThreshold: 1

It’s a common pattern to put /readyz as a separate endpoint from /healthz. The two endpoints look at different things.

  • /healthz (liveness) — only the state of the own process
  • /readyz (readiness) — own process + DB ping + cache connection + state of dependent external services

A Pod that fails readiness doesn’t die but stays alive, with traffic only briefly cut. While the DB connection is temporarily down, readiness becomes false and traffic is blocked; when the DB recovers, readiness returns to true and traffic flows in again. A model that absorbs transient failures without container restart.

Verifying the shape of being removed from Endpoints #

A short look at how Endpoints (or its successor object EndpointSlice) changes when readiness fails.

Service and Endpoints
kubectl get svc web
kubectl get endpoints web
When all readiness is normal
NAME   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
web    ClusterIP   10.96.123.45    <none>        80/TCP    1d

NAME   ENDPOINTS                                       AGE
web    10.244.1.10:8080,10.244.1.11:8080,10.244.2.5:8080   1d

All three Pod IPs being in Endpoints is the normal state. When one Pod’s readiness drops to failure, only that IP is removed from Endpoints.

When one Pod's readiness is false
NAME   ENDPOINTS                                       AGE
web    10.244.1.10:8080,10.244.2.5:8080                1d

The Service doesn’t send traffic to that Pod. In kubectl get pods, that Pod shows as READY 0/1, with the container still Running.

kubectl get pods — on readiness failure
NAME           READY   STATUS    RESTARTS   AGE
web-7c4d-aa1   1/1     Running   0          1d
web-7c4d-bb2   0/1     Running   0          1d
web-7c4d-cc3   1/1     Running   0          1d

The 0/1 in the READY column is the key. It means one container is up but 0 of them are ready, and RESTARTS doesn’t increase in this state.

When a Pod has multiple containers #

When a Pod has multiple containers and one of their readiness is false, the entire Pod’s ready becomes false and it’s removed from Endpoints. Even if two containers are healthy, just one container’s readiness not coming up means traffic doesn’t enter the entire Pod. This is intended behavior — the Pod is K8s’s routing unit, and if one piece inside isn’t ready, not sending traffic to that Pod is safer.

startup probe — guardian for slow-starting apps #

The third probe, startup, is a relatively new object that became beta in 1.16 and stable in 1.18. The problem it solves is clear — slow-starting apps.

Java/Spring Boot, Rails, workloads that load big ML models into memory often take more than 60 seconds to start. Consider what happens when only a liveness probe (no startup probe) is configured on such an app: if the app takes 60 seconds to start and liveness has initialDelaySeconds: 10 — from the 10-second mark K8s starts calling /healthz, the app can’t yet respond, failures accumulate, and the container eventually dies. K8s brings it back up and the same thing repeats, falling into an infinite restart loop.

The workaround of setting initialDelaySeconds to something large like 90 or 120 seconds creates a new problem — real failures during operation are also detected that much later. Even if a deadlock occurs during normal operation, the first 90 seconds are unguarded. The cost of inflating initialDelaySeconds to cover startup time is reduced detection sensitivity during normal operation.

The startup probe cleanly resolves this separation. Until startup succeeds, liveness and readiness are inactive, and once startup succeeds, startup doesn’t run again — liveness/readiness then operate on their normal cadence.

startup probe — excerpt
startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 30

The manifest above means up to 5 minutes (10 seconds × 30 times) is allowed for startup. If /healthz responds with 200 even once within 5 minutes, startup is judged successful and startup probe doesn’t run again. From then on, liveness/readiness operate on their normal cadence. If 5 minutes pass without a single success, startup is judged failed, the container is terminated, and restartPolicy brings it back up.

The key formula is simple. failureThreshold × periodSeconds is the maximum time allowed for startup. If Spring Boot takes 60 seconds on average and occasionally 90, the typical math is failureThreshold: 12 × periodSeconds: 10 = 120 seconds.

Three common operational incident scenarios #

With the three-probe model in hand, here are three incidents commonly encountered in operation. Avoiding just these three eliminates a large share of health-check-related incidents.

Incident 1 — liveness only, no readiness #

The most common first incident. When the manifest has only liveness and no readiness, K8s judges that Pod ready as soon as the container starts. The Pod is immediately added to the Service’s Endpoints and traffic comes in.

The problem is the moment the container has just started. The process is up and is listening, but the DB connection pool isn’t yet filled, or the cache isn’t preloaded. Traffic that comes in at that moment throws 502, affecting users. If you see brief 502 bursts on every rolling update, a missing readiness probe should be the first thing to suspect.

The fix is simple — add a readiness probe and have the /readyz endpoint inside split 200/503 by checking DB ping and cache state. Then the Pod doesn’t enter Endpoints until it’s truly ready to take traffic.

Incident 2 — liveness too aggressive #

The second incident lies in the liveness parameters themselves. When operating with the default timeoutSeconds: 1, a brief DB slowdown or a longer GC pause causes the health check to miss the 1-second window. After 3 consecutive failures a container restart triggers, and the freshly restarted container runs GC again, response is again slow, and it fails again.

Once this cycle starts, it’s hard to break. The same pattern repeats until the operator raises timeoutSeconds in the manifest. Starting with liveness values around timeoutSeconds: 3–5 and failureThreshold: 3–5 is safer.

Incident 3 — DB ping put into liveness #

The third incident happens when manifests are written without understanding the model separation. With /healthz checking even the DB ping, all app containers’ liveness fails simultaneously when the DB briefly goes down, and they all enter restart together.

Even if the DB recovers in 30 seconds, the apps may not come back for a while — if the apps themselves take 30+ seconds to start, the outage stretches even longer. Worse, once the apps do come back, if the DB wobbles again, they die again and fall into a cascading failure cycle.

The rule is one line. Liveness for the process itself, readiness for external dependencies. Where do external dependencies like DB, cache, and other microservices belong? Readiness. When the DB goes down, readiness becomes false and traffic is blocked; when the DB recovers, readiness returns to true and traffic flows again. The container never dies, so cascading failure never happens.

probes and graceful shutdown #

The topic stacked on top of the probe model is graceful shutdown. To prevent in-flight requests from becoming 502 when a Pod terminates, traffic must be cut first, then the container killed. K8s progresses through these steps:

  1. Pod enters Terminating state.
  2. K8s removes the Pod’s IP from Endpoints (traffic cut starts).
  3. At the same time, sends SIGTERM to the container.
  4. Waits up to terminationGracePeriodSeconds (default 30s) for the container to terminate.
  5. If it doesn’t die after that, force-terminates with SIGKILL.

The subtle part here is that steps 2 and 3 happen almost simultaneously. Endpoints updates take time to propagate through the K8s control plane to each node’s kube-proxy, but SIGTERM arrives instantly. As a result, a window opens where the container that received SIGTERM has just begun terminating, but the Endpoints update hasn’t fully propagated, and a few last requests still arrive at that Pod. Those requests hit the terminating container and become 502s.

Filling the window with a PreStop hook #

The tool to fill this gap is the lifecycle.preStop hook. K8s runs this command before sending SIGTERM, and a short sleep here buys time for the Endpoints update to propagate.

preStop hook — excerpt
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: web
          image: myapp:1.4.0
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10"]

The flow of the manifest above:

  1. Pod terminate starts → K8s removes from Endpoints.
  2. K8s runs the preStop hook → sleeps for 10 seconds.
  3. During those 10 seconds, the Endpoints update fully propagates across the cluster — no new traffic comes in.
  4. When preStop ends, K8s sends SIGTERM to the container.
  5. The container processes in-flight requests inside and terminates cleanly.
  6. If termination doesn’t complete, SIGKILL after terminationGracePeriodSeconds (60s).

terminationGracePeriodSeconds includes preStop’s time. That is, in the example above, of the 60 seconds, 10 are spent on preStop and the remaining 50 are spent on post-SIGTERM termination. Setting preStop to 20 seconds reduces post-SIGTERM time to 40, so both values must be adjusted together.

Having the app handle SIGTERM directly #

Another path to the same effect exists: writing logic directly into the app so that its readiness endpoint responds false upon receiving SIGTERM. As soon as SIGTERM arrives, /readyz starts responding 503, and on the next readiness check K8s removes that Pod from Endpoints. Meanwhile the in-flight requests are processed and termination completes cleanly.

This approach achieves clean graceful shutdown without a PreStop hook. The precondition is that the app-level SIGTERM handler must work correctly — the PID 1 problem and init tools covered in Docker Advanced #6 become relevant again at this point. If the container’s PID 1 ignores SIGTERM, the readiness-to-false logic never runs either.

Combined manifest #

An example bringing the three probes and graceful shutdown together in one manifest. Assuming a Java Spring Boot app.

full-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-api
  template:
    metadata:
      labels:
        app: order-api
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: order-api
          image: myorg/order-api:2.3.0
          ports:
            - name: http
              containerPort: 8080
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              memory: "1Gi"
          startupProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            periodSeconds: 10
            failureThreshold: 18
            timeoutSeconds: 3
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8080
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
            successThreshold: 1
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10"]

The intent written in the manifest, line by line:

  • startup probe: allow up to 180 seconds for startup (10 × 18). Spring Boot’s average startup time + headroom.
  • liveness probe: starts working after startup succeeds. /actuator/health/liveness only looks at the own process state (no DB ping).
  • readiness probe: /actuator/health/readiness looks at the DB connection pool and external dependencies. When DB is briefly down, readiness becomes false and traffic is blocked, and the container stays alive.
  • preStop sleep 10s + terminationGracePeriodSeconds 60s: secure enough window for graceful shutdown.

Spring Boot 2.3+’s actuator provides liveness and readiness endpoints separately as standard, so applying this kind of configuration is relatively easy. In other frameworks too, it’s the operational standard pattern to have the same separation (own state / external dependency) carved into two endpoints at code level.

Docker HEALTHCHECK and K8s probes #

Briefly cleaning up the relationship between Docker’s HEALTHCHECK instruction (touched on in Docker Advanced #6) and K8s probes.

HEALTHCHECK in Dockerfile
HEALTHCHECK --interval=30s --timeout=3s \
  CMD curl -f http://localhost:8080/healthz || exit 1

This instruction is the model where the Docker daemon runs that check when a container is started directly via docker run. The STATUS column of docker ps stamps (healthy) / (unhealthy), and Docker Compose’s depends_on.condition: service_healthy reads this value too.

K8s ignores this HEALTHCHECK value. What K8s sees is only livenessProbe / readinessProbe / startupProbe in the Pod manifest. Putting the same image into K8s, the Dockerfile’s HEALTHCHECK is simply ignored, and probes must be written separately in the manifest. The two models look similar but operate at different layers — the check of one container is Docker’s responsibility, and the check of a K8s workload is K8s’s.

If the image is going to be used in both, having the same intent of check written in both places is fine — but it must be clear that the probe in the K8s manifest is the check that actually runs.

Summary #

The flow held in this post:

  • Three probes split by role — liveness is “is it alive” → on failure, container restart; readiness is “is it ready to take traffic” → on failure, removed from Endpoints; startup is the startup-phase guardian → liveness/readiness inactive until success.
  • Three check methodshttpGet (most common, success on 200~399), tcpSocket (non-HTTP servers), exec (most flexible but fork cost). Prefer httpGet when possible.
  • Common parametersinitialDelaySeconds, periodSeconds, timeoutSeconds, failureThreshold, successThreshold. Defaults (especially timeoutSeconds: 1) are too aggressive for operation and should be raised.
  • Liveness for the own process, external dependencies for readiness — putting DB pings into liveness leads to cascading failure and infinite restart loops.
  • startup probe — guardian for slow-starting apps (Spring Boot, Rails). failureThreshold * periodSeconds is the startup allowance time. Stable from 1.18.
  • graceful shutdown — buy Endpoints update time with terminationGracePeriodSeconds (default 30s) and a sleep in the preStop hook, then process in-flight requests after SIGTERM. The app code pattern of dropping readiness to false on SIGTERM has the same effect.
  • Docker HEALTHCHECK is ignored by K8s — what K8s sees is only the manifest’s probes. The two models operate at different layers.

Once this model is in hand, you can read at a glance the operational scenarios that the three probe blocks, terminationGracePeriodSeconds, and preStop in a Pod manifest are guarding against.

Next — Autoscaling (HPA / VPA / Cluster Autoscaler) #

What we’ve covered so far is one Pod’s resource model (#4) and that Pod’s health judgment (this post). Numbers like replicas: 3 were written directly into the manifest by hand. But traffic in an operational cluster swings significantly by time of day and day of week, and manually adjusting replicas each time is not sustainable.

#6 Autoscaling — HPA / VPA / Cluster Autoscaler walks through three objects that fill that gap in one cycle. HPA (Horizontal Pod Autoscaler) is a controller that automatically scales replicas up and down by CPU/memory/custom metrics. VPA (Vertical Pod Autoscaler) is a different-axis model that automatically adjusts a single Pod’s requests / limits itself. Cluster Autoscaler is a one-level-higher object that automatically adds nodes themselves when there aren’t enough nodes for Pods to schedule on. And since HPA’s input metrics are ultimately gathered only from Pods whose readiness is true, the readiness model covered in this post reappears as the starting point of the next one.

X