Certified Kubernetes Application Developer (CKAD) #11 Probes: liveness, readiness, startup (exec/HTTP/TCP)

Infrastructure Kubernetes Container Orchestration Certification

Tuesday, May 26, 2026

10 min read

A container being up doesn’t mean the application inside it is working correctly. The process can be alive but deadlocked and unable to respond, or it can be up but still mid-initialization and not yet ready to take requests. The mechanism Kubernetes uses to tell these two situations apart and handle them is the probe.

This post covers how Kubernetes checks whether a container is alive, and whether it’s ready to take traffic. Probes fall under the Observability and Maintenance (15%) domain of CKAD, and they show up frequently as manifest-writing tasks. The exact YAML format and the meaning of the parameters decide your score far more than conceptual depth, so we’ll learn by typing out the examples until they’re second nature.

What is a probe #

A probe is a diagnostic that the kubelet runs periodically to check the health of a container. At a fixed interval, the kubelet runs the check against the container and, depending on the result (success or failure), responds by restarting the container or removing it from the Service endpoints.

Without probes, Kubernetes only watches whether the container’s main process is alive. If the process dies, it restarts according to the restartPolicy — but a zombie state, where the process is alive yet unable to respond, goes undetected. Probes fill that gap.

If this K8s practical track #5 post covered the basic behavior of Pods and containers, this post goes one level deeper into how the health of those containers gets judged.

The three kinds of probe #

Kubernetes has three kinds of probe, each with a different purpose. Distinguishing among the three precisely is the heart of this post.

probe	What it asks	Action on failure
livenessProbe	Is the container alive?	Restart the container
readinessProbe	Is it ready to take traffic?	Remove from Service endpoints (no restart)
startupProbe	Has a slow application finished initializing?	Disable the other probes until it passes

livenessProbe #

The livenessProbe asks whether the container is alive. When the check fails, the kubelet kills the container and restarts it per the restartPolicy. Its purpose is automatic recovery of a container whose process is up but stuck in a deadlock or infinite loop and can’t respond.

The thing to watch for is that overly aggressive liveness settings can actually cause outages. Put a short liveness probe on an application that’s slow to initialize, and you get a CrashLoop where a healthy-but-still-warming-up container gets killed and restarted over and over.

readinessProbe #

The readinessProbe asks whether the container is ready to take traffic. When the check fails, the kubelet doesn’t kill the container; instead it removes that Pod’s IP from the Service endpoint list. In other words, traffic stops going to that Pod. Once the check succeeds again, the Pod automatically returns to the endpoints.

Use it for temporary states where the Pod shouldn’t take requests — cache warm-up, establishing a DB connection, waiting on a dependency. Not restarting the container is the decisive difference from liveness.

startupProbe #

The startupProbe exists to protect a slow application’s initialization. For a legacy application that takes a long time to start, the liveness and readiness probes stay disabled until the startupProbe passes. That keeps liveness from killing the container during a long initialization.

Once the startupProbe succeeds, liveness and readiness operate normally from then on. In other words, the startupProbe is a safety device dedicated to the startup window, and once it passes it never runs again.

The three kinds of handler #

Each probe specifies how it performs its check via a handler. There are three handlers, and any of them can be attached to any probe.

Handler	How it checks	Success criterion
exec	Runs a command inside the container	Exit code 0
httpGet	HTTP GET to the given path and port	Response code 200〜399
tcpSocket	Attempts a TCP connection to the given port	Connection established

exec #

Runs a command inside the container and treats an exit code of 0 as success. It suits workloads without HTTP, such as checking for the existence of a file or running a custom health script.

livenessProbe:
  exec:
    command:
      - cat
      - /tmp/healthy
  initialDelaySeconds: 5
  periodSeconds: 5

In the example above, if the /tmp/healthy file exists inside the container, cat exits with code 0 and the probe succeeds; if the file is missing, it fails.

httpGet #

Sends an HTTP GET request to the given path and port, and treats a response code of 200〜399 as success. It’s the most common probe form for web applications.

readinessProbe:
  httpGet:
    path: /healthz
    port: 8080
    httpHeaders:
      - name: X-Probe
        value: readiness
  initialDelaySeconds: 10
  periodSeconds: 5

port can be specified not only as a number but also by the name of a container port, and you can attach custom headers with httpHeaders. If you need an HTTPS check, add scheme: HTTPS.

tcpSocket #

Treats a successfully established TCP connection to the given port as success. Use it when all you need to confirm is that a port is open — for example a database or message broker that has no HTTP endpoint.

livenessProbe:
  tcpSocket:
    port: 6379
  initialDelaySeconds: 15
  periodSeconds: 10

grpc #

For a gRPC server, the grpc handler can use the standard gRPC health check protocol. It’s supported by default in recent versions, and the form is grpc: { port: 50051 }.

Probe parameters #

Apart from the handler, every probe has a set of common parameters that control the timing of the check. Knowing the meaning and arithmetic of these values precisely is something the exam asks for often.

Parameter	Default	Meaning
initialDelaySeconds	0	Wait time after the container starts before the first check
periodSeconds	10	Check interval
timeoutSeconds	1	Time limit a single check waits for a response
successThreshold	1	Consecutive successes needed to recover to success after a failure
failureThreshold	3	Consecutive failures needed to confirm a failure

initialDelaySeconds. A container that just came up may not be ready yet, so the first check is delayed by this much. If this value is too small on liveness, it can kill the container mid-initialization.
periodSeconds. The interval at which the check repeats. Shorter means faster detection but more load.
timeoutSeconds. If a check doesn’t respond within this time, that check counts as a failure. The default of 1 second can be short for heavy handlers.
successThreshold. It often matters for readiness; for liveness and startup it must be 1.
failureThreshold. A single failure doesn’t trigger action immediately — a failure is confirmed only after this many consecutive failures.

Calculating the time to failure #

The maximum time before liveness actually restarts a container can be estimated as follows.

First check begins = initialDelaySeconds
Failure confirmed  = initialDelaySeconds + periodSeconds × failureThreshold

For example, with initialDelaySeconds: 10, periodSeconds: 5, and failureThreshold: 3, a restart can occur around 10 + 5 × 3 = 25 seconds after the container starts. The startupProbe’s failureThreshold × periodSeconds equals the maximum time the application is allowed to take to start.

liveness vs readiness: the classic exam mix-up #

The difference between these two is the spot most often confused on the CKAD, so let’s lay it out clearly one more time.

Aspect	livenessProbe	readinessProbe
What it asks	Is it alive?	Is it ready to take traffic?
On failure	Restart the container	Remove from endpoints
Does it kill the container?	Yes	No
Recovery path	Re-check after restart	Return to endpoints on a successful check

Just remember the core: a liveness failure means restart, a readiness failure means removal from endpoints. If you see “traffic routing” in a question, it’s readiness; if you see “restart” or “recovery,” it’s liveness. Using both together is common, and typically readiness should pass before liveness kicks in.

A combined YAML example #

Here’s all three probes attached to a single container — the most common combination in both practice and the exam.

apiVersion: v1
kind: Pod
metadata:
  name: web
spec:
  containers:
    - name: app
      image: nginx
      ports:
        - containerPort: 80
      startupProbe:
        httpGet:
          path: /healthz
          port: 80
        failureThreshold: 30
        periodSeconds: 10
      readinessProbe:
        httpGet:
          path: /ready
          port: 80
        initialDelaySeconds: 5
        periodSeconds: 5
      livenessProbe:
        httpGet:
          path: /healthz
          port: 80
        initialDelaySeconds: 15
        periodSeconds: 10
        failureThreshold: 3

In this manifest, the startupProbe waits up to 30 × 10 = 300 seconds for startup. Within that window, the moment /healthz returns a 200 once, startup passes and readiness and liveness begin operating from then on.

How to build it fast on the exam #

There’s no probe-specific generator, so the fast flow is to create a Pod skeleton with the dry-run learned in #1, then add just the probe block.

k run web --image=nginx --port=80 $do > web.yaml
# Then add the probe block under containers in web.yaml
# If the field path is unclear, confirm it with explain
k explain pod.spec.containers.livenessProbe --recursive

Troubleshooting #

A misconfigured probe wrecks a perfectly fine application. On the exam, too, this shows up as a “why is this Pod behaving like this?” type of question.

CrashLoopBackOff: when liveness is too aggressive #

If liveness’s initialDelaySeconds is too short, or the path/port is wrong, it repeatedly kills a healthy container and you get CrashLoopBackOff. First look at the events and status.

k describe pod web
# Check for "Liveness probe failed" and the restart history under Events
k get pod web -o jsonpath='{.status.containerStatuses[0].restartCount}'

If you see a Liveness probe failed event, increase initialDelaySeconds, or — if initialization is slow — add a startupProbe to protect the startup window.

Missing endpoints: when readiness can’t pass #

When traffic isn’t reaching the Service, suspect readiness. If readiness fails, that Pod never makes it onto the endpoint list.

k get endpoints my-svc
# If ADDRESSES is empty, no Pod is ready
k describe pod web | grep -A3 Readiness

If the endpoints are empty but the Pod is Running, start by checking whether readiness’s path/port matches the application’s actual health path.

Common mistakes #

Setting the same path for liveness and readiness, and letting liveness check the state of dependencies too — if a dependency briefly goes down, even a healthy container gets restarted. The principle is liveness checks only itself; readiness checks dependencies too.
Putting a port in port that the container doesn’t actually open. This is a common mistake for both tcpSocket and httpGet.
Putting only a short liveness probe on a slow application, with no startupProbe.

Exam points #

The failure behavior of the three probes. liveness = restart, readiness = remove from endpoints, startup = disable the other probes until it passes. That one line is the core.
The success criteria of the three handlers. exec = exit code 0, httpGet = 200〜399, tcpSocket = connection established.
The five parameters — their meanings and defaults, plus how to calculate the failure-confirmation time with initialDelaySeconds + periodSeconds × failureThreshold.
Telling liveness and readiness apart by the wording of the question. “Restart” means liveness, “traffic” means readiness.
Confirming field paths instantly with k explain pod.spec.containers.livenessProbe --recursive.

Wrap-up #

What this post locked in:

A probe is a mechanism by which the kubelet periodically checks container health. It fills the zombie-state gap that process liveness alone can’t catch.
The purposes and failure behaviors of the three kinds — liveness (restart), readiness (remove from endpoints), startup (startup protection).
The format and success criteria of the three handlers — exec, httpGet, tcpSocket — plus the one-line grpc.
The meanings of initialDelaySeconds, periodSeconds, timeoutSeconds, successThreshold, failureThreshold, and the failure-confirmation time calculation.
Troubleshooting. CrashLoop from aggressive liveness settings, missing endpoints from readiness not passing.

Next: Observability #

You’ve learned how to judge a container’s health with probes. The tools for investigating why a probe failed are the next topic.

In #12 Observability: logging, kubectl debug, port-forward, ephemeral container, we’ll get the various options of kubectl logs, kubectl debug and ephemeral containers for diagnosing dead containers, and kubectl port-forward for attaching to a Pod directly from your local machine — the tools for picking up troubleshooting points on the practical exam — into your hands.