Certified Kubernetes Application Developer (CKAD) #11 Probes: liveness, readiness, startup (exec/HTTP/TCP)
A container being up doesn’t mean the application inside it is working correctly. The process can be alive but deadlocked and unable to respond, or it can be up but still mid-initialization and not yet ready to take requests. The mechanism Kubernetes uses to tell these two situations apart and handle them is the probe.
This post covers how Kubernetes checks whether a container is alive, and whether it’s ready to take traffic. Probes fall under the Observability and Maintenance (15%) domain of CKAD, and they show up frequently as manifest-writing tasks. The exact YAML format and the meaning of the parameters decide your score far more than conceptual depth, so we’ll learn by typing out the examples until they’re second nature.
What is a probe #
A probe is a diagnostic that the kubelet runs periodically to check the health of a container. At a fixed interval, the kubelet runs the check against the container and, depending on the result (success or failure), responds by restarting the container or removing it from the Service endpoints.
Without probes, Kubernetes only watches whether the container’s main process is alive. If the process dies, it restarts according to the restartPolicy — but a zombie state, where the process is alive yet unable to respond, goes undetected. Probes fill that gap.
If this K8s practical track #5 post covered the basic behavior of Pods and containers, this post goes one level deeper into how the health of those containers gets judged.
The three kinds of probe #
Kubernetes has three kinds of probe, each with a different purpose. Distinguishing among the three precisely is the heart of this post.
| probe | What it asks | Action on failure |
|---|---|---|
| livenessProbe | Is the container alive? | Restart the container |
| readinessProbe | Is it ready to take traffic? | Remove from Service endpoints (no restart) |
| startupProbe | Has a slow application finished initializing? | Disable the other probes until it passes |
livenessProbe #
The livenessProbe asks whether the container is alive. When the check fails, the kubelet kills the container and restarts it per the restartPolicy. Its purpose is automatic recovery of a container whose process is up but stuck in a deadlock or infinite loop and can’t respond.
The thing to watch for is that overly aggressive liveness settings can actually cause outages. Put a short liveness probe on an application that’s slow to initialize, and you get a CrashLoop where a healthy-but-still-warming-up container gets killed and restarted over and over.
readinessProbe #
The readinessProbe asks whether the container is ready to take traffic. When the check fails, the kubelet doesn’t kill the container; instead it removes that Pod’s IP from the Service endpoint list. In other words, traffic stops going to that Pod. Once the check succeeds again, the Pod automatically returns to the endpoints.
Use it for temporary states where the Pod shouldn’t take requests — cache warm-up, establishing a DB connection, waiting on a dependency. Not restarting the container is the decisive difference from liveness.
startupProbe #
The startupProbe exists to protect a slow application’s initialization. For a legacy application that takes a long time to start, the liveness and readiness probes stay disabled until the startupProbe passes. That keeps liveness from killing the container during a long initialization.
Once the startupProbe succeeds, liveness and readiness operate normally from then on. In other words, the startupProbe is a safety device dedicated to the startup window, and once it passes it never runs again.
The three kinds of handler #
Each probe specifies how it performs its check via a handler. There are three handlers, and any of them can be attached to any probe.
| Handler | How it checks | Success criterion |
|---|---|---|
| exec | Runs a command inside the container | Exit code 0 |
| httpGet | HTTP GET to the given path and port | Response code 200〜399 |
| tcpSocket | Attempts a TCP connection to the given port | Connection established |
exec #
Runs a command inside the container and treats an exit code of 0 as success. It suits workloads without HTTP, such as checking for the existence of a file or running a custom health script.
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5In the example above, if the /tmp/healthy file exists inside the container, cat exits with code 0 and the probe succeeds; if the file is missing, it fails.
httpGet #
Sends an HTTP GET request to the given path and port, and treats a response code of 200〜399 as success. It’s the most common probe form for web applications.
readinessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: X-Probe
value: readiness
initialDelaySeconds: 10
periodSeconds: 5port can be specified not only as a number but also by the name of a container port, and you can attach custom headers with httpHeaders. If you need an HTTPS check, add scheme: HTTPS.
tcpSocket #
Treats a successfully established TCP connection to the given port as success. Use it when all you need to confirm is that a port is open — for example a database or message broker that has no HTTP endpoint.
livenessProbe:
tcpSocket:
port: 6379
initialDelaySeconds: 15
periodSeconds: 10grpc #
For a gRPC server, the grpc handler can use the standard gRPC health check protocol. It’s supported by default in recent versions, and the form is grpc: { port: 50051 }.
Probe parameters #
Apart from the handler, every probe has a set of common parameters that control the timing of the check. Knowing the meaning and arithmetic of these values precisely is something the exam asks for often.
| Parameter | Default | Meaning |
|---|---|---|
| initialDelaySeconds | 0 | Wait time after the container starts before the first check |
| periodSeconds | 10 | Check interval |
| timeoutSeconds | 1 | Time limit a single check waits for a response |
| successThreshold | 1 | Consecutive successes needed to recover to success after a failure |
| failureThreshold | 3 | Consecutive failures needed to confirm a failure |
- initialDelaySeconds. A container that just came up may not be ready yet, so the first check is delayed by this much. If this value is too small on liveness, it can kill the container mid-initialization.
- periodSeconds. The interval at which the check repeats. Shorter means faster detection but more load.
- timeoutSeconds. If a check doesn’t respond within this time, that check counts as a failure. The default of 1 second can be short for heavy handlers.
- successThreshold. It often matters for readiness; for liveness and startup it must be 1.
- failureThreshold. A single failure doesn’t trigger action immediately — a failure is confirmed only after this many consecutive failures.
Calculating the time to failure #
The maximum time before liveness actually restarts a container can be estimated as follows.
First check begins = initialDelaySeconds
Failure confirmed = initialDelaySeconds + periodSeconds × failureThresholdFor example, with initialDelaySeconds: 10, periodSeconds: 5, and failureThreshold: 3, a restart can occur around 10 + 5 × 3 = 25 seconds after the container starts. The startupProbe’s failureThreshold × periodSeconds equals the maximum time the application is allowed to take to start.
liveness vs readiness: the classic exam mix-up #
The difference between these two is the spot most often confused on the CKAD, so let’s lay it out clearly one more time.
| Aspect | livenessProbe | readinessProbe |
|---|---|---|
| What it asks | Is it alive? | Is it ready to take traffic? |
| On failure | Restart the container | Remove from endpoints |
| Does it kill the container? | Yes | No |
| Recovery path | Re-check after restart | Return to endpoints on a successful check |
Just remember the core: a liveness failure means restart, a readiness failure means removal from endpoints. If you see “traffic routing” in a question, it’s readiness; if you see “restart” or “recovery,” it’s liveness. Using both together is common, and typically readiness should pass before liveness kicks in.
A combined YAML example #
Here’s all three probes attached to a single container — the most common combination in both practice and the exam.
apiVersion: v1
kind: Pod
metadata:
name: web
spec:
containers:
- name: app
image: nginx
ports:
- containerPort: 80
startupProbe:
httpGet:
path: /healthz
port: 80
failureThreshold: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 80
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /healthz
port: 80
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3In this manifest, the startupProbe waits up to 30 × 10 = 300 seconds for startup. Within that window, the moment /healthz returns a 200 once, startup passes and readiness and liveness begin operating from then on.
How to build it fast on the exam #
There’s no probe-specific generator, so the fast flow is to create a Pod skeleton with the dry-run learned in #1, then add just the probe block.
k run web --image=nginx --port=80 $do > web.yaml
# Then add the probe block under containers in web.yaml
# If the field path is unclear, confirm it with explain
k explain pod.spec.containers.livenessProbe --recursiveTroubleshooting #
A misconfigured probe wrecks a perfectly fine application. On the exam, too, this shows up as a “why is this Pod behaving like this?” type of question.
CrashLoopBackOff: when liveness is too aggressive #
If liveness’s initialDelaySeconds is too short, or the path/port is wrong, it repeatedly kills a healthy container and you get CrashLoopBackOff. First look at the events and status.
k describe pod web
# Check for "Liveness probe failed" and the restart history under Events
k get pod web -o jsonpath='{.status.containerStatuses[0].restartCount}'If you see a Liveness probe failed event, increase initialDelaySeconds, or — if initialization is slow — add a startupProbe to protect the startup window.
Missing endpoints: when readiness can’t pass #
When traffic isn’t reaching the Service, suspect readiness. If readiness fails, that Pod never makes it onto the endpoint list.
k get endpoints my-svc
# If ADDRESSES is empty, no Pod is ready
k describe pod web | grep -A3 ReadinessIf the endpoints are empty but the Pod is Running, start by checking whether readiness’s path/port matches the application’s actual health path.
Common mistakes #
- Setting the same path for liveness and readiness, and letting liveness check the state of dependencies too — if a dependency briefly goes down, even a healthy container gets restarted. The principle is liveness checks only itself; readiness checks dependencies too.
- Putting a port in
portthat the container doesn’t actually open. This is a common mistake for both tcpSocket and httpGet. - Putting only a short liveness probe on a slow application, with no startupProbe.
Exam points #
- The failure behavior of the three probes. liveness = restart, readiness = remove from endpoints, startup = disable the other probes until it passes. That one line is the core.
- The success criteria of the three handlers. exec = exit code 0, httpGet = 200〜399, tcpSocket = connection established.
- The five parameters — their meanings and defaults, plus how to calculate the failure-confirmation time with
initialDelaySeconds + periodSeconds × failureThreshold. - Telling liveness and readiness apart by the wording of the question. “Restart” means liveness, “traffic” means readiness.
- Confirming field paths instantly with
k explain pod.spec.containers.livenessProbe --recursive.
Wrap-up #
What this post locked in:
- A probe is a mechanism by which the kubelet periodically checks container health. It fills the zombie-state gap that process liveness alone can’t catch.
- The purposes and failure behaviors of the three kinds — liveness (restart), readiness (remove from endpoints), startup (startup protection).
- The format and success criteria of the three handlers — exec, httpGet, tcpSocket — plus the one-line grpc.
- The meanings of initialDelaySeconds, periodSeconds, timeoutSeconds, successThreshold, failureThreshold, and the failure-confirmation time calculation.
- Troubleshooting. CrashLoop from aggressive liveness settings, missing endpoints from readiness not passing.
Next: Observability #
You’ve learned how to judge a container’s health with probes. The tools for investigating why a probe failed are the next topic.
In #12 Observability: logging, kubectl debug, port-forward, ephemeral container, we’ll get the various options of kubectl logs, kubectl debug and ephemeral containers for diagnosing dead containers, and kubectl port-forward for attaching to a Pod directly from your local machine — the tools for picking up troubleshooting points on the practical exam — into your hands.