Certified Kubernetes Administrator (CKA) #22 Troubleshooting 1: Pods and Apps (Pending, CrashLoop, ImagePull, OOM)

Infrastructure Kubernetes Container Orchestration Certification

Monday, June 1, 2026

11 min read

With #21 Helm and Kustomize, we wrapped up every domain about building and deploying manifests. From here, the next four posts are about troubleshooting — fixing things that are already broken. On the CKA exam, Troubleshooting is the single largest domain at 30%. Among the five domains, the task of tracking down and fixing a cluster someone has broken carries more points than the task of building something new. With a passing score of 66%, missing this 30% almost certainly means failing.

The heart of troubleshooting is not guessing. Instead of looking at a symptom and guessing the cause in your head, read down through the facts the cluster has already recorded — the Events in describe, the container logs, the exit code — in order, and the cause almost reveals itself. In this post we’ll diagnose four Pod-level failures in exactly that order.

Why Troubleshooting is 30% #

Of CKA’s five domains, Troubleshooting is the single largest at 30%. Add the second-place Cluster Architecture (25%) and the two together cross half the exam. This weight is no accident. A cluster administrator’s real job is closer to diagnosing and recovering when something that was running stops than it is to building new resources.

So on the exam too, troubleshooting questions hand you an already-broken state, like “this Pod won’t come up — fix it.” Unlike questions where you write a manifest from scratch, what divides your score here is the diagnostic speed with which you pin down what broke. This post covers only the most common of these, the Pod-level failures; nodes (#23), the control plane (#24), and networking (#25) continue in the following posts.

Diagnostic tools: the reading order is everything #

In troubleshooting, the commands themselves are few. What matters is the order in which you read them. Drill the following sequence until it’s second nature and the cause of most Pod failures will reveal itself within 1–2 minutes.

# 1) Look at the overall state first — STATUS, RESTARTS, AGE
k get pod -o wide

# 2) Read the Events in describe first (90% of diagnosis is here)
k describe pod <name>

# 3) If a container came up and then died, look at the previous container's logs
k logs <name>
k logs <name> --previous

# 4) If there are multiple containers, specify with -c
k logs <name> -c <container>

# 5) Cluster-wide events in chronological order
k get events --sort-by=.metadata.creationTimestamp

Each tool answers a different question.

Tool	The question it answers
`k get pod -o wide`	What is the STATUS right now, which node did it land on, and how many RESTARTS
`k describe pod`	What did the scheduler and kubelet do to this Pod (Events)
`k logs`	What message did the app leave just before dying
`k logs --previous`	What did the container that actually died say, just before the restart
`k get events`	What recently happened at the cluster level

Exam point: read describe’s Events first #

The most common mistake is starting with k logs. If the state is Pending, the container never came up, so there are no logs. An ImagePull failure is also before container start, so there are no logs. In these cases the answer is all written in the Events section of describe. So the order is always describe (events) first, logs second. Only when a container came up and then died, like CrashLoop, does logs --previous become the decisive clue.

Causes and first-line diagnosis by symptom #

Let’s pin the four symptoms in one table first, then drill each down to reproduce and fix.

Symptom (STATUS)	Did the container come up	Where to look first	Typical cause
Pending	No	`describe` Events	Insufficient resources, nodeSelector mismatch, taint, unbound PVC
CrashLoopBackOff	Came up, then died	`logs --previous`	App error, wrong command, probe failure
ImagePullBackOff / ErrImagePull	No	`describe` Events	Image tag typo, registry auth failure
OOMKilled	Came up, then died	`describe`’s Last State	Memory limit exceeded (exit code 137)

The second column of this table is the branch point of diagnosis. If the container hasn’t come up yet, read describe Events (a scheduler/kubelet story); if it came up and died, read logs –previous (an app story).

1) Pending: it won’t get scheduled #

Pending is the state where kube-scheduler couldn’t find a node to place this Pod on. The container hasn’t even started yet, so there are no logs. The answer is in describe’s Events.

Reproduce #

Demand excessive resources so the Pod fits on no node.

k run hungry --image=nginx \
  --overrides='{"spec":{"containers":[{"name":"hungry","image":"nginx","resources":{"requests":{"cpu":"100"}}}]}}'

Diagnose #

k get pod hungry
# NAME     READY   STATUS    RESTARTS   AGE
# hungry   0/1     Pending   0          20s

k describe pod hungry

In the Events section, lines like the following are the key.

Events:
  Warning  FailedScheduling  ... 0/3 nodes are available:
  3 Insufficient cpu. preemption: 0/3 nodes are available ...

The FailedScheduling message tells you the reason verbatim. The cause of Pending almost always splits on this one line.

Phrase seen in Events	Cause	Fix
`Insufficient cpu` / `Insufficient memory`	requests larger than the node’s free capacity	Lower requests, or add nodes / free up capacity
`node(s) didn't match node selector`	The nodeSelector label exists on no node	Add the label to a node or fix the selector
`node(s) had untolerated taint`	The Pod has no toleration for the node’s taint	Add a toleration to the Pod
`pod has unbound immediate PersistentVolumeClaims`	The PVC isn’t bound to a PV	Check PV/StorageClass, resolve the PVC binding

Fix #

If it’s insufficient resources, lower requests to a realistic value.

k delete pod hungry
k run hungry --image=nginx

For a nodeSelector mismatch, check the node labels and match them.

# What label it requires
k get pod <name> -o jsonpath='{.spec.nodeSelector}'

# Whether the node has that label
k get nodes --show-labels

# When fixing by attaching the label to the node
k label node node01 disktype=ssd

If a taint is the cause, add a toleration (below), or remove the taint from the node if it wasn’t intended.

tolerations:
- key: "key1"
  operator: "Exists"
  effect: "NoSchedule"

For an unbound PVC, check the state with k get pvc and k get pv. If StorageClass dynamic provisioning is in place, it should bind automatically; if it’s static, a matching PV must exist. The deep diagnosis of this part applies the storage posts (#16 , #17) directly.

2) CrashLoopBackOff: it comes up and keeps dying #

CrashLoopBackOff is the state where the container starts but quickly terminates, and the kubelet keeps restarting it with a steadily increasing backoff interval. The RESTARTS number keeps climbing. Here, the logs of the container that died — that is, logs --previous — are decisive.

Reproduce #

Run a nonexistent command so it terminates immediately.

k run crasher --image=busybox --restart=Always -- /bin/sh -c "exit 1"

Diagnose #

k get pod crasher
# NAME      READY   STATUS             RESTARTS      AGE
# crasher   0/1     CrashLoopBackOff   3 (20s ago)   60s

# The current container may be empty because it's in backoff
k logs crasher

# The last log of the container that died
k logs crasher --previous

Without --previous, you’ll be looking at the empty container waiting in backoff and miss the clue. CrashLoop diagnosis almost always uses --previous.

Causes and fixes #

Cause	Clue in describe / logs	Fix
App terminated on its own error	Stack trace / error message in logs	Fix the app config (env vars / ConfigMap)
Wrong command/args	`exec: "..." : not found`, exit 127	Fix command/args to match the image
Required config missing	`missing env`, connection failure in logs	Check ConfigMap/Secret mount and keys
liveness probe failure	`Liveness probe failed` in describe Events	Adjust probe path/port/initialDelaySeconds

A probe failure as the cause is especially confusing. The app is fine, but the liveness probe checks too soon, or against the wrong path, so the kubelet keeps killing a healthy container. If you see Liveness probe failed: ... in describe Events, suspect the probe config.

# Check the probe config
k get pod crasher -o jsonpath='{.spec.containers[0].livenessProbe}'

If it terminated due to a command typo, fix the command/args in the manifest. On the exam, you either edit the Deployment directly or fix the manifest and re-apply it.

k edit deploy <name>
# Or after fixing the manifest
k apply -f deploy.yaml

3) ImagePullBackOff / ErrImagePull: it can’t pull the image #

These two are the state where the kubelet couldn’t pull the container image. ErrImagePull shows up first, and once the retries enter backoff it becomes ImagePullBackOff. The container never even started, so there are no logs; the cause is in describe’s Events.

Reproduce #

Specify a nonexistent tag.

k run badimg --image=nginx:doesnotexist

Diagnose #

k get pod badimg
# NAME     READY   STATUS             RESTARTS   AGE
# badimg   0/1     ImagePullBackOff   0          30s

k describe pod badimg

Look at the following lines in Events.

Events:
  Warning  Failed  ... Failed to pull image "nginx:doesnotexist":
  ... manifest for nginx:doesnotexist not found

Causes and fixes #

Events clue	Cause	Fix
`manifest for ... not found`	Image name/tag typo	Fix the image/tag to the correct value
`repository does not exist`	Registry path typo, private repository	Check the full path (registry/repo:tag)
`pull access denied` / `unauthorized`	Registry authentication failure	Set up / attach imagePullSecrets
`no such host` / timeout	The registry is unreachable from the node	Check the node’s network and DNS

A tag typo is the most common. Correct it to the right tag.

k set image pod/badimg badimg=nginx:1.27
# For a Deployment
k set image deploy/<name> <container>=nginx:1.27

For a private registry auth failure, create an imagePullSecret and attach it to the ServiceAccount or Pod spec.

k create secret docker-registry regcred \
  --docker-server=<registry> \
  --docker-username=<user> \
  --docker-password=<pass>

spec:
  imagePullSecrets:
  - name: regcred

4) OOMKilled: it exceeded the memory limit #

OOMKilled is the state where the container exceeded its own memory limit and was force-terminated by the kernel OOM killer. The characteristic signal is exit code 137 (128 + SIGKILL 9). The container came up and died, but since the one that killed it was the kernel rather than the app, the logs may carry no trace. The clue is in describe’s Last State.

Reproduce #

Set a small limit and have it use more memory than that.

k run oom --image=polinux/stress \
  --overrides='{"spec":{"containers":[{"name":"oom","image":"polinux/stress","resources":{"limits":{"memory":"20Mi"}},"command":["stress"],"args":["--vm","1","--vm-bytes","250M"]}]}}'

Diagnose #

k get pod oom
# NAME   READY   STATUS      RESTARTS      AGE
# oom    0/1     OOMKilled   2 (10s ago)   40s   # or repeats as CrashLoopBackOff

k describe pod oom

In describe, the following part is decisive.

    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137

When you see Reason: OOMKilled and Exit Code: 137 together, an out-of-memory is confirmed. If RESTARTS climbs alongside it, the STATUS may show as CrashLoopBackOff, so use exit code 137 as the clue to separate out the memory problem.

Fix #

The cause is one of two things: either the limit is set unrealistically low relative to the app’s actual usage (a configuration problem), or the app genuinely uses too much memory (an application problem).

# Look at the usual usage (requires metrics-server)
k top pod oom

If too low a limit is the cause, raise it to a realistic value.

resources:
  requests:
    memory: "128Mi"
  limits:
    memory: "256Mi"

The relationship between requests and limits, and the point that the QoS class (BestEffort/Burstable/Guaranteed) affects which Pod dies first under OOM, apply directly from what was covered in the resource management post (#15). How to observe memory and CPU and set alerts in a production environment is organized along the metrics axis in the observability post.

The diagnostic flow on one page #

When you hit a Pod failure in the exam room, go down this order.

Look at STATUS and RESTARTS with k get pod -o wide
Always read the Events in k describe pod first
If STATUS is Pending → branch on the FailedScheduling phrase in Events into resources/selector/taint/PVC
If it’s an ImagePull family → branch on the Failed to pull image phrase in Events into tag/auth/network
If the container came up and died → check the app message with k logs --previous
If describe’s Last State shows OOMKilled / Exit Code: 137 → handle it as a memory limit problem

The starting point of this flow is always the same. Don’t guess — read describe’s Events first. This one habit takes half of Troubleshooting’s 30%.

Wrap-up #

What this post locked in:

Troubleshooting is CKA’s largest domain (30%). The diagnostic speed of fixing what’s broken quickly is what divides your score
The diagnostic tools are k describe (Events), k logs --previous, k get events, k get pod -o wide. Always read describe’s Events first
Pending. The scheduler couldn’t find a node. Branch on the FailedScheduling phrase into insufficient resources / nodeSelector / taint / unbound PVC
CrashLoopBackOff. Came up, then died. Check the app error / wrong command / probe failure with logs --previous
ImagePullBackOff / ErrImagePull. Couldn’t pull the image. Branch on Events into tag typo / registry auth / network
OOMKilled. Memory limit exceeded. Reason: OOMKilled and exit code 137 in describe’s Last State

Next: Troubleshooting 2 #

We’ve got the Pod level down. But sometimes a Pod won’t come up even with a perfectly fine manifest, and worse, the whole node falls into NotReady. Then you have to go one level lower, down to the node and kubelet.

In #23 Troubleshooting 2: Nodes and kubelet, we’ll track down the causes of a node going NotReady. We’ll diagnose and recover cases where the kubelet service died, the certificate or kubeconfig is out of sync, or the node is under disk pressure or memory pressure, going down through them with systemctl status kubelet and journalctl -u kubelet.