K8s Intermediate #5: Health Checks — liveness / readiness / startup probes
The fifth post in the K8s Intermediate series. Through #4, what we held was the model of how much resource to give a Pod. With CPU and memory’s requests and limits, the conditions under which the scheduler and cgroup place that Pod are determined. But just because resources are sufficient doesn’t mean that container is actually doing work. The process can be up but deadlocked inside, or the container may have just started but the DB connection pool isn’t yet filled, so it shouldn’t take traffic. This post walks through how K8s judges these two questions — “is it alive” and “is it ready to take traffic” — and the three kinds of probes that ground that judgment, all in one cycle.
This series is K8s Intermediate, 7 posts.
- #1 StatefulSet / DaemonSet / Job / CronJob — Controllers beyond Deployment
- #2 PV / PVC / StorageClass — the persistent data model
- #3 Ingress and Ingress Controller — the external entry point
- #4 resources.requests / limits — Pod resource requests and limits
- #5 Health checks — liveness / readiness / startup probes ← this post
- #6 Autoscaling — HPA / VPA / Cluster Autoscaler
- #7 RBAC / NetworkPolicy / ResourceQuota — security and resource policy
Why split into three probes #
When seeing probes for the first time, a natural question arises — “isn’t checking just whether the container is alive enough?” The reason this question isn’t simple from an operational view is that the single phrase “alive” mixes two different meanings. There’s a state where the process is up and fine at the OS level, but the cache inside isn’t filled and any traffic immediately gets a 502. The answer to “should it be restarted?” for that container is “no,” and the answer to “should it receive traffic?” is “not yet.” The two answers differ.
K8s separates these two answers into different objects — liveness and readiness. And it adds one more guardian layer for slow-to-start apps — startup. Putting the three probes’ roles in one table:
| Probe | Question asked | K8s action on failure | Scope of impact |
|---|---|---|---|
| liveness | Is this container alive | Restart that container | That single container |
| readiness | Is this Pod ready to take traffic | Remove that Pod from Service Endpoints | Traffic routing |
| startup | Has this container finished starting | Terminate the container (restart per restartPolicy) | Container startup phase |
The decisive difference among the three probes is what failure leads to. Liveness failure causes container restart, readiness failure causes traffic blocking, startup failure causes startup-phase termination. Without knowing this difference in outcomes, writing manifests leads directly to incidents like “the container is alive but throwing 502s” or “a perfectly healthy app fell into an infinite restart loop.”
Container restart and Pod recreation are different #
One often-confused point to flag in advance. The result of liveness failure is container restart, not Pod recreation. The Pod stays alive, and only the container inside is terminated and started again in the same Pod. The RESTARTS column in kubectl get pods going up 1, 2, 3 is the signal. The Pod itself doesn’t move to another node or get a new IP. Meanwhile, readiness failure doesn’t touch the container — it stays alive as is, only excluded from the Service’s Endpoints list so traffic doesn’t come in.
Three check methods — httpGet / tcpSocket / exec #
All three probes can choose one of the same three check methods. Each fits different scenarios and has different costs.
| Method | Behavior | Suitable workload | Cost |
|---|---|---|---|
| httpGet | HTTP GET to the specified path/port. Success on 200~399 response | HTTP server (most web/API) | Low |
| tcpSocket | TCP connection attempt on the specified port. Success on connection | Non-HTTP server (DBs, some gRPC, Redis) | Very low |
| exec | Execute a command inside the container. Success on exit 0 | Workloads needing arbitrary script checks | High (forks new process) |
httpGet — the most common choice #
For most web/API servers, httpGet is the first candidate.
livenessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: X-Probe
value: kubelet
initialDelaySeconds: 10
periodSeconds: 10The path /healthz is a convention in the K8s ecosystem — in project code you’ll often see names like /health, /healthz, /ping, /-/healthy. Response codes in the 200~399 range are judged success, 4xx and 5xx are failure. The response body isn’t inspected.
The strength of httpGet is that the app code can express its own state directly. Instead of just “the process is up,” it can split into 200/503 by meanings like “DB connection pool is healthy” or “cache is filled.”
tcpSocket — just the port being open #
For non-HTTP servers, tcpSocket is a natural choice.
readinessProbe:
tcpSocket:
port: 5432
initialDelaySeconds: 5
periodSeconds: 10PostgreSQL, MySQL, Redis, and other non-HTTP servers are common targets. K8s attempts a TCP 3-way handshake to that port — success is OK, failure is NG. Note that TCP connectivity doesn’t mean the server can actually process queries — even a Postgres instance that has just started listening but hasn’t finished startup will accept TCP connections. So for database workload readiness, running pg_isready via exec is more accurate than tcpSocket.
exec — checks via arbitrary command #
Checks that can only be expressed as a specific command use exec.
readinessProbe:
exec:
command:
- /bin/sh
- -c
- pg_isready -h 127.0.0.1 -p 5432
initialDelaySeconds: 5
periodSeconds: 10exec forks a new process inside the container to run the command, with exit code 0 meaning success. The most flexible, but the most expensive. The fork itself is not a cheap operation, and if the command goes through sh and then spawns a client binary, each check is heavy. Even at one check per minute, the load adds up across hundreds of containers. The standard operational order is to consider httpGet first whenever possible, and fall back to tcpSocket or exec when that path is unavailable.
Common parameters — time and thresholds #
The three probes share the same time parameters. In one table:
| Field | Meaning | Default |
|---|---|---|
initialDelaySeconds | Time to wait after container start before the first check | 0 |
periodSeconds | Check interval | 10 |
timeoutSeconds | Upper bound on time waited for a single check’s response | 1 |
failureThreshold | How many consecutive failures count as final failure | 3 |
successThreshold | How many consecutive successes count as final success (fixed at 1 for liveness/startup) | 1 |
These five values fully decide one probe’s behavior. For example, with periodSeconds: 10 and failureThreshold: 3, K8s sees the probe as truly failed only after up to 30 seconds of consecutive failures. timeoutSeconds: 1 means a single check that doesn’t respond within 1 second is treated as a failed round.
The defaults are often too aggressive to use as-is in operation. In particular, timeoutSeconds: 1 causes failures even when GC takes slightly longer or node load briefly spikes. Leaving that default in a liveness probe means transient response delays translate directly into container restarts. In operational manifests, raising timeoutSeconds to 3–5 seconds and setting failureThreshold to about 3–5 is almost always safer.
liveness probe — is it alive #
The role of the liveness probe is to find containers that are dead but pretending not to be. The state where the process is up but in a deadlock unable to respond to any request, the state where memory leaks have stretched response time to infinity — these are the targets. When liveness fails, K8s sends SIGTERM and on timeout SIGKILL to terminate the container, then restarts it according to the Pod’s restartPolicy. The default restartPolicy for Deployments is Always, so almost every workload gets automatic restart.
spec:
template:
spec:
containers:
- name: web
image: myapp:1.4.0
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3The meaning of this manifest:
- After the container starts, don’t check for 30 seconds (
initialDelaySeconds). - Then call
/healthzevery 10 seconds (periodSeconds). - Treat a single call as a failed round if it doesn’t respond within 3 seconds (
timeoutSeconds). - After 3 consecutive failures (
failureThreshold), see liveness as failed and restart the container.
When checks really fail and the container restarts, traces remain in the events of kubectl describe pod and the RESTARTS count of kubectl get pods.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 2m kubelet Liveness probe failed: HTTP probe failed with statuscode: 503
Normal Killing 2m kubelet Container web failed liveness probe, will be restarted
Normal Pulled 2m kubelet Container image "myapp:1.4.0" already present on machine
Normal Created 2m kubelet Created container web
Normal Started 2m kubelet Started container webThe shape where Liveness probe failed and Container ... will be restarted immediately after are stamped together is the standard. Pods where this trace appears frequently should be suspected of liveness probe issues — you need to determine whether the container is genuinely crashing, or whether the probe is too aggressive and killing healthy containers.
What should go in liveness #
This is where operational incidents happen most. To put the conclusion first — liveness probe should look only at its own process state. Don’t put external dependencies (DB, cache, other microservices) into liveness.
The reason is cascading failure. If the DB briefly goes down and all app containers’ liveness fails simultaneously, they all restart at the same time. Even after the DB recovers, the apps may not come back up for some time. In worse cases, the restarted app can’t reach the DB again, fails liveness again, and falls into an infinite restart loop. Liveness for internal process state, external dependencies for readiness — locking this separation in from the start is safer.
The /healthz endpoint usually only checks:
- The app process can produce a response (it reached the HTTP handler).
- The internal deadlock detection is OK.
Never put DB pings or external service calls into this endpoint as the operational standard.
readiness probe — is it ready to take traffic #
The role of the readiness probe is the gate for traffic routing. Unlike liveness, readiness doesn’t kill the container — instead it removes that Pod from the Service’s Endpoints list. As a result, no new requests come into that Pod.
spec:
template:
spec:
containers:
- name: web
image: myapp:1.4.0
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
successThreshold: 1It’s a common pattern to put /readyz as a separate endpoint from /healthz. The two endpoints look at different things.
/healthz(liveness) — only the state of the own process/readyz(readiness) — own process + DB ping + cache connection + state of dependent external services
A Pod that fails readiness doesn’t die but stays alive, with traffic only briefly cut. While the DB connection is temporarily down, readiness becomes false and traffic is blocked; when the DB recovers, readiness returns to true and traffic flows in again. A model that absorbs transient failures without container restart.
Verifying the shape of being removed from Endpoints #
A short look at how Endpoints (or its successor object EndpointSlice) changes when readiness fails.
kubectl get svc web
kubectl get endpoints webNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
web ClusterIP 10.96.123.45 <none> 80/TCP 1d
NAME ENDPOINTS AGE
web 10.244.1.10:8080,10.244.1.11:8080,10.244.2.5:8080 1dAll three Pod IPs being in Endpoints is the normal state. When one Pod’s readiness drops to failure, only that IP is removed from Endpoints.
NAME ENDPOINTS AGE
web 10.244.1.10:8080,10.244.2.5:8080 1dThe Service doesn’t send traffic to that Pod. In kubectl get pods, that Pod shows as READY 0/1, with the container still Running.
NAME READY STATUS RESTARTS AGE
web-7c4d-aa1 1/1 Running 0 1d
web-7c4d-bb2 0/1 Running 0 1d
web-7c4d-cc3 1/1 Running 0 1dThe 0/1 in the READY column is the key. It means one container is up but 0 of them are ready, and RESTARTS doesn’t increase in this state.
When a Pod has multiple containers #
When a Pod has multiple containers and one of their readiness is false, the entire Pod’s ready becomes false and it’s removed from Endpoints. Even if two containers are healthy, just one container’s readiness not coming up means traffic doesn’t enter the entire Pod. This is intended behavior — the Pod is K8s’s routing unit, and if one piece inside isn’t ready, not sending traffic to that Pod is safer.
startup probe — guardian for slow-starting apps #
The third probe, startup, is a relatively new object that became beta in 1.16 and stable in 1.18. The problem it solves is clear — slow-starting apps.
Java/Spring Boot, Rails, workloads that load big ML models into memory often take more than 60 seconds to start. Consider what happens when only a liveness probe (no startup probe) is configured on such an app: if the app takes 60 seconds to start and liveness has initialDelaySeconds: 10 — from the 10-second mark K8s starts calling /healthz, the app can’t yet respond, failures accumulate, and the container eventually dies. K8s brings it back up and the same thing repeats, falling into an infinite restart loop.
The workaround of setting initialDelaySeconds to something large like 90 or 120 seconds creates a new problem — real failures during operation are also detected that much later. Even if a deadlock occurs during normal operation, the first 90 seconds are unguarded. The cost of inflating initialDelaySeconds to cover startup time is reduced detection sensitivity during normal operation.
The startup probe cleanly resolves this separation. Until startup succeeds, liveness and readiness are inactive, and once startup succeeds, startup doesn’t run again — liveness/readiness then operate on their normal cadence.
startupProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 30The manifest above means up to 5 minutes (10 seconds × 30 times) is allowed for startup. If /healthz responds with 200 even once within 5 minutes, startup is judged successful and startup probe doesn’t run again. From then on, liveness/readiness operate on their normal cadence. If 5 minutes pass without a single success, startup is judged failed, the container is terminated, and restartPolicy brings it back up.
The key formula is simple. failureThreshold × periodSeconds is the maximum time allowed for startup. If Spring Boot takes 60 seconds on average and occasionally 90, the typical math is failureThreshold: 12 × periodSeconds: 10 = 120 seconds.
Three common operational incident scenarios #
With the three-probe model in hand, here are three incidents commonly encountered in operation. Avoiding just these three eliminates a large share of health-check-related incidents.
Incident 1 — liveness only, no readiness #
The most common first incident. When the manifest has only liveness and no readiness, K8s judges that Pod ready as soon as the container starts. The Pod is immediately added to the Service’s Endpoints and traffic comes in.
The problem is the moment the container has just started. The process is up and is listening, but the DB connection pool isn’t yet filled, or the cache isn’t preloaded. Traffic that comes in at that moment throws 502, affecting users. If you see brief 502 bursts on every rolling update, a missing readiness probe should be the first thing to suspect.
The fix is simple — add a readiness probe and have the /readyz endpoint inside split 200/503 by checking DB ping and cache state. Then the Pod doesn’t enter Endpoints until it’s truly ready to take traffic.
Incident 2 — liveness too aggressive #
The second incident lies in the liveness parameters themselves. When operating with the default timeoutSeconds: 1, a brief DB slowdown or a longer GC pause causes the health check to miss the 1-second window. After 3 consecutive failures a container restart triggers, and the freshly restarted container runs GC again, response is again slow, and it fails again.
Once this cycle starts, it’s hard to break. The same pattern repeats until the operator raises timeoutSeconds in the manifest. Starting with liveness values around timeoutSeconds: 3–5 and failureThreshold: 3–5 is safer.
Incident 3 — DB ping put into liveness #
The third incident happens when manifests are written without understanding the model separation. With /healthz checking even the DB ping, all app containers’ liveness fails simultaneously when the DB briefly goes down, and they all enter restart together.
Even if the DB recovers in 30 seconds, the apps may not come back for a while — if the apps themselves take 30+ seconds to start, the outage stretches even longer. Worse, once the apps do come back, if the DB wobbles again, they die again and fall into a cascading failure cycle.
The rule is one line. Liveness for the process itself, readiness for external dependencies. Where do external dependencies like DB, cache, and other microservices belong? Readiness. When the DB goes down, readiness becomes false and traffic is blocked; when the DB recovers, readiness returns to true and traffic flows again. The container never dies, so cascading failure never happens.
probes and graceful shutdown #
The topic stacked on top of the probe model is graceful shutdown. To prevent in-flight requests from becoming 502 when a Pod terminates, traffic must be cut first, then the container killed. K8s progresses through these steps:
- Pod enters
Terminatingstate. - K8s removes the Pod’s IP from Endpoints (traffic cut starts).
- At the same time, sends
SIGTERMto the container. - Waits up to
terminationGracePeriodSeconds(default 30s) for the container to terminate. - If it doesn’t die after that, force-terminates with
SIGKILL.
The subtle part here is that steps 2 and 3 happen almost simultaneously. Endpoints updates take time to propagate through the K8s control plane to each node’s kube-proxy, but SIGTERM arrives instantly. As a result, a window opens where the container that received SIGTERM has just begun terminating, but the Endpoints update hasn’t fully propagated, and a few last requests still arrive at that Pod. Those requests hit the terminating container and become 502s.
Filling the window with a PreStop hook #
The tool to fill this gap is the lifecycle.preStop hook. K8s runs this command before sending SIGTERM, and a short sleep here buys time for the Endpoints update to propagate.
spec:
template:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: web
image: myapp:1.4.0
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"]The flow of the manifest above:
- Pod terminate starts → K8s removes from Endpoints.
- K8s runs the
preStophook → sleeps for 10 seconds. - During those 10 seconds, the Endpoints update fully propagates across the cluster — no new traffic comes in.
- When preStop ends, K8s sends
SIGTERMto the container. - The container processes in-flight requests inside and terminates cleanly.
- If termination doesn’t complete,
SIGKILLafterterminationGracePeriodSeconds(60s).
terminationGracePeriodSeconds includes preStop’s time. That is, in the example above, of the 60 seconds, 10 are spent on preStop and the remaining 50 are spent on post-SIGTERM termination. Setting preStop to 20 seconds reduces post-SIGTERM time to 40, so both values must be adjusted together.
Having the app handle SIGTERM directly #
Another path to the same effect exists: writing logic directly into the app so that its readiness endpoint responds false upon receiving SIGTERM. As soon as SIGTERM arrives, /readyz starts responding 503, and on the next readiness check K8s removes that Pod from Endpoints. Meanwhile the in-flight requests are processed and termination completes cleanly.
This approach achieves clean graceful shutdown without a PreStop hook. The precondition is that the app-level SIGTERM handler must work correctly — the PID 1 problem and init tools covered in Docker Advanced #6 become relevant again at this point. If the container’s PID 1 ignores SIGTERM, the readiness-to-false logic never runs either.
Combined manifest #
An example bringing the three probes and graceful shutdown together in one manifest. Assuming a Java Spring Boot app.
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-api
spec:
replicas: 3
selector:
matchLabels:
app: order-api
template:
metadata:
labels:
app: order-api
spec:
terminationGracePeriodSeconds: 60
containers:
- name: order-api
image: myorg/order-api:2.3.0
ports:
- name: http
containerPort: 8080
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
memory: "1Gi"
startupProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
periodSeconds: 10
failureThreshold: 18
timeoutSeconds: 3
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
successThreshold: 1
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"]The intent written in the manifest, line by line:
- startup probe: allow up to 180 seconds for startup (10 × 18). Spring Boot’s average startup time + headroom.
- liveness probe: starts working after startup succeeds.
/actuator/health/livenessonly looks at the own process state (no DB ping). - readiness probe:
/actuator/health/readinesslooks at the DB connection pool and external dependencies. When DB is briefly down, readiness becomes false and traffic is blocked, and the container stays alive. - preStop sleep 10s + terminationGracePeriodSeconds 60s: secure enough window for graceful shutdown.
Spring Boot 2.3+’s actuator provides liveness and readiness endpoints separately as standard, so applying this kind of configuration is relatively easy. In other frameworks too, it’s the operational standard pattern to have the same separation (own state / external dependency) carved into two endpoints at code level.
Docker HEALTHCHECK and K8s probes #
Briefly cleaning up the relationship between Docker’s HEALTHCHECK instruction (touched on in Docker Advanced #6) and K8s probes.
HEALTHCHECK --interval=30s --timeout=3s \
CMD curl -f http://localhost:8080/healthz || exit 1This instruction is the model where the Docker daemon runs that check when a container is started directly via docker run. The STATUS column of docker ps stamps (healthy) / (unhealthy), and Docker Compose’s depends_on.condition: service_healthy reads this value too.
K8s ignores this HEALTHCHECK value. What K8s sees is only livenessProbe / readinessProbe / startupProbe in the Pod manifest. Putting the same image into K8s, the Dockerfile’s HEALTHCHECK is simply ignored, and probes must be written separately in the manifest. The two models look similar but operate at different layers — the check of one container is Docker’s responsibility, and the check of a K8s workload is K8s’s.
If the image is going to be used in both, having the same intent of check written in both places is fine — but it must be clear that the probe in the K8s manifest is the check that actually runs.
Summary #
The flow held in this post:
- Three probes split by role — liveness is “is it alive” → on failure, container restart; readiness is “is it ready to take traffic” → on failure, removed from Endpoints; startup is the startup-phase guardian → liveness/readiness inactive until success.
- Three check methods —
httpGet(most common, success on 200~399),tcpSocket(non-HTTP servers),exec(most flexible but fork cost). PreferhttpGetwhen possible. - Common parameters —
initialDelaySeconds,periodSeconds,timeoutSeconds,failureThreshold,successThreshold. Defaults (especiallytimeoutSeconds: 1) are too aggressive for operation and should be raised. - Liveness for the own process, external dependencies for readiness — putting DB pings into liveness leads to cascading failure and infinite restart loops.
- startup probe — guardian for slow-starting apps (Spring Boot, Rails).
failureThreshold * periodSecondsis the startup allowance time. Stable from 1.18. - graceful shutdown — buy Endpoints update time with
terminationGracePeriodSeconds(default 30s) and a sleep in thepreStophook, then process in-flight requests after SIGTERM. The app code pattern of dropping readiness to false on SIGTERM has the same effect. - Docker HEALTHCHECK is ignored by K8s — what K8s sees is only the manifest’s probes. The two models operate at different layers.
Once this model is in hand, you can read at a glance the operational scenarios that the three probe blocks, terminationGracePeriodSeconds, and preStop in a Pod manifest are guarding against.
Next — Autoscaling (HPA / VPA / Cluster Autoscaler) #
What we’ve covered so far is one Pod’s resource model (#4) and that Pod’s health judgment (this post). Numbers like replicas: 3 were written directly into the manifest by hand. But traffic in an operational cluster swings significantly by time of day and day of week, and manually adjusting replicas each time is not sustainable.
#6 Autoscaling — HPA / VPA / Cluster Autoscaler walks through three objects that fill that gap in one cycle. HPA (Horizontal Pod Autoscaler) is a controller that automatically scales replicas up and down by CPU/memory/custom metrics. VPA (Vertical Pod Autoscaler) is a different-axis model that automatically adjusts a single Pod’s requests / limits itself. Cluster Autoscaler is a one-level-higher object that automatically adds nodes themselves when there aren’t enough nodes for Pods to schedule on. And since HPA’s input metrics are ultimately gathered only from Pods whose readiness is true, the readiness model covered in this post reappears as the starting point of the next one.