Certified Kubernetes Administrator (CKA) #22 Troubleshooting 1: Pods and Apps (Pending, CrashLoop, ImagePull, OOM)
With #21 Helm and Kustomize, we wrapped up every domain about building and deploying manifests. From here, the next four posts are about troubleshooting — fixing things that are already broken. On the CKA exam, Troubleshooting is the single largest domain at 30%. Among the five domains, the task of tracking down and fixing a cluster someone has broken carries more points than the task of building something new. With a passing score of 66%, missing this 30% almost certainly means failing.
The heart of troubleshooting is not guessing. Instead of looking at a symptom and guessing the cause in your head, read down through the facts the cluster has already recorded — the Events in describe, the container logs, the exit code — in order, and the cause almost reveals itself. In this post we’ll diagnose four Pod-level failures in exactly that order.
Why Troubleshooting is 30% #
Of CKA’s five domains, Troubleshooting is the single largest at 30%. Add the second-place Cluster Architecture (25%) and the two together cross half the exam. This weight is no accident. A cluster administrator’s real job is closer to diagnosing and recovering when something that was running stops than it is to building new resources.
So on the exam too, troubleshooting questions hand you an already-broken state, like “this Pod won’t come up — fix it.” Unlike questions where you write a manifest from scratch, what divides your score here is the diagnostic speed with which you pin down what broke. This post covers only the most common of these, the Pod-level failures; nodes (#23), the control plane (#24), and networking (#25) continue in the following posts.
Diagnostic tools: the reading order is everything #
In troubleshooting, the commands themselves are few. What matters is the order in which you read them. Drill the following sequence until it’s second nature and the cause of most Pod failures will reveal itself within 1–2 minutes.
# 1) Look at the overall state first — STATUS, RESTARTS, AGE
k get pod -o wide
# 2) Read the Events in describe first (90% of diagnosis is here)
k describe pod <name>
# 3) If a container came up and then died, look at the previous container's logs
k logs <name>
k logs <name> --previous
# 4) If there are multiple containers, specify with -c
k logs <name> -c <container>
# 5) Cluster-wide events in chronological order
k get events --sort-by=.metadata.creationTimestampEach tool answers a different question.
| Tool | The question it answers |
|---|---|
k get pod -o wide | What is the STATUS right now, which node did it land on, and how many RESTARTS |
k describe pod | What did the scheduler and kubelet do to this Pod (Events) |
k logs | What message did the app leave just before dying |
k logs --previous | What did the container that actually died say, just before the restart |
k get events | What recently happened at the cluster level |
Exam point: read describe’s Events first #
The most common mistake is starting with k logs. If the state is Pending, the container never came up, so there are no logs. An ImagePull failure is also before container start, so there are no logs. In these cases the answer is all written in the Events section of describe. So the order is always describe (events) first, logs second. Only when a container came up and then died, like CrashLoop, does logs --previous become the decisive clue.
Causes and first-line diagnosis by symptom #
Let’s pin the four symptoms in one table first, then drill each down to reproduce and fix.
| Symptom (STATUS) | Did the container come up | Where to look first | Typical cause |
|---|---|---|---|
| Pending | No | describe Events | Insufficient resources, nodeSelector mismatch, taint, unbound PVC |
| CrashLoopBackOff | Came up, then died | logs --previous | App error, wrong command, probe failure |
| ImagePullBackOff / ErrImagePull | No | describe Events | Image tag typo, registry auth failure |
| OOMKilled | Came up, then died | describe’s Last State | Memory limit exceeded (exit code 137) |
The second column of this table is the branch point of diagnosis. If the container hasn’t come up yet, read describe Events (a scheduler/kubelet story); if it came up and died, read logs –previous (an app story).
1) Pending: it won’t get scheduled #
Pending is the state where kube-scheduler couldn’t find a node to place this Pod on. The container hasn’t even started yet, so there are no logs. The answer is in describe’s Events.
Reproduce #
Demand excessive resources so the Pod fits on no node.
k run hungry --image=nginx \
--overrides='{"spec":{"containers":[{"name":"hungry","image":"nginx","resources":{"requests":{"cpu":"100"}}}]}}'Diagnose #
k get pod hungry
# NAME READY STATUS RESTARTS AGE
# hungry 0/1 Pending 0 20s
k describe pod hungryIn the Events section, lines like the following are the key.
Events:
Warning FailedScheduling ... 0/3 nodes are available:
3 Insufficient cpu. preemption: 0/3 nodes are available ...The FailedScheduling message tells you the reason verbatim. The cause of Pending almost always splits on this one line.
| Phrase seen in Events | Cause | Fix |
|---|---|---|
Insufficient cpu / Insufficient memory | requests larger than the node’s free capacity | Lower requests, or add nodes / free up capacity |
node(s) didn't match node selector | The nodeSelector label exists on no node | Add the label to a node or fix the selector |
node(s) had untolerated taint | The Pod has no toleration for the node’s taint | Add a toleration to the Pod |
pod has unbound immediate PersistentVolumeClaims | The PVC isn’t bound to a PV | Check PV/StorageClass, resolve the PVC binding |
Fix #
If it’s insufficient resources, lower requests to a realistic value.
k delete pod hungry
k run hungry --image=nginxFor a nodeSelector mismatch, check the node labels and match them.
# What label it requires
k get pod <name> -o jsonpath='{.spec.nodeSelector}'
# Whether the node has that label
k get nodes --show-labels
# When fixing by attaching the label to the node
k label node node01 disktype=ssdIf a taint is the cause, add a toleration (below), or remove the taint from the node if it wasn’t intended.
tolerations:
- key: "key1"
operator: "Exists"
effect: "NoSchedule"For an unbound PVC, check the state with k get pvc and k get pv. If StorageClass dynamic provisioning is in place, it should bind automatically; if it’s static, a matching PV must exist. The deep diagnosis of this part applies the storage posts (#16 , #17) directly.
2) CrashLoopBackOff: it comes up and keeps dying #
CrashLoopBackOff is the state where the container starts but quickly terminates, and the kubelet keeps restarting it with a steadily increasing backoff interval. The RESTARTS number keeps climbing. Here, the logs of the container that died — that is, logs --previous — are decisive.
Reproduce #
Run a nonexistent command so it terminates immediately.
k run crasher --image=busybox --restart=Always -- /bin/sh -c "exit 1"Diagnose #
k get pod crasher
# NAME READY STATUS RESTARTS AGE
# crasher 0/1 CrashLoopBackOff 3 (20s ago) 60s
# The current container may be empty because it's in backoff
k logs crasher
# The last log of the container that died
k logs crasher --previousWithout --previous, you’ll be looking at the empty container waiting in backoff and miss the clue. CrashLoop diagnosis almost always uses --previous.
Causes and fixes #
| Cause | Clue in describe / logs | Fix |
|---|---|---|
| App terminated on its own error | Stack trace / error message in logs | Fix the app config (env vars / ConfigMap) |
| Wrong command/args | exec: "..." : not found, exit 127 | Fix command/args to match the image |
| Required config missing | missing env, connection failure in logs | Check ConfigMap/Secret mount and keys |
| liveness probe failure | Liveness probe failed in describe Events | Adjust probe path/port/initialDelaySeconds |
A probe failure as the cause is especially confusing. The app is fine, but the liveness probe checks too soon, or against the wrong path, so the kubelet keeps killing a healthy container. If you see Liveness probe failed: ... in describe Events, suspect the probe config.
# Check the probe config
k get pod crasher -o jsonpath='{.spec.containers[0].livenessProbe}'If it terminated due to a command typo, fix the command/args in the manifest. On the exam, you either edit the Deployment directly or fix the manifest and re-apply it.
k edit deploy <name>
# Or after fixing the manifest
k apply -f deploy.yaml3) ImagePullBackOff / ErrImagePull: it can’t pull the image #
These two are the state where the kubelet couldn’t pull the container image. ErrImagePull shows up first, and once the retries enter backoff it becomes ImagePullBackOff. The container never even started, so there are no logs; the cause is in describe’s Events.
Reproduce #
Specify a nonexistent tag.
k run badimg --image=nginx:doesnotexistDiagnose #
k get pod badimg
# NAME READY STATUS RESTARTS AGE
# badimg 0/1 ImagePullBackOff 0 30s
k describe pod badimgLook at the following lines in Events.
Events:
Warning Failed ... Failed to pull image "nginx:doesnotexist":
... manifest for nginx:doesnotexist not foundCauses and fixes #
| Events clue | Cause | Fix |
|---|---|---|
manifest for ... not found | Image name/tag typo | Fix the image/tag to the correct value |
repository does not exist | Registry path typo, private repository | Check the full path (registry/repo:tag) |
pull access denied / unauthorized | Registry authentication failure | Set up / attach imagePullSecrets |
no such host / timeout | The registry is unreachable from the node | Check the node’s network and DNS |
A tag typo is the most common. Correct it to the right tag.
k set image pod/badimg badimg=nginx:1.27
# For a Deployment
k set image deploy/<name> <container>=nginx:1.27For a private registry auth failure, create an imagePullSecret and attach it to the ServiceAccount or Pod spec.
k create secret docker-registry regcred \
--docker-server=<registry> \
--docker-username=<user> \
--docker-password=<pass>spec:
imagePullSecrets:
- name: regcred4) OOMKilled: it exceeded the memory limit #
OOMKilled is the state where the container exceeded its own memory limit and was force-terminated by the kernel OOM killer. The characteristic signal is exit code 137 (128 + SIGKILL 9). The container came up and died, but since the one that killed it was the kernel rather than the app, the logs may carry no trace. The clue is in describe’s Last State.
Reproduce #
Set a small limit and have it use more memory than that.
k run oom --image=polinux/stress \
--overrides='{"spec":{"containers":[{"name":"oom","image":"polinux/stress","resources":{"limits":{"memory":"20Mi"}},"command":["stress"],"args":["--vm","1","--vm-bytes","250M"]}]}}'Diagnose #
k get pod oom
# NAME READY STATUS RESTARTS AGE
# oom 0/1 OOMKilled 2 (10s ago) 40s # or repeats as CrashLoopBackOff
k describe pod oomIn describe, the following part is decisive.
Last State: Terminated
Reason: OOMKilled
Exit Code: 137When you see Reason: OOMKilled and Exit Code: 137 together, an out-of-memory is confirmed. If RESTARTS climbs alongside it, the STATUS may show as CrashLoopBackOff, so use exit code 137 as the clue to separate out the memory problem.
Fix #
The cause is one of two things: either the limit is set unrealistically low relative to the app’s actual usage (a configuration problem), or the app genuinely uses too much memory (an application problem).
# Look at the usual usage (requires metrics-server)
k top pod oomIf too low a limit is the cause, raise it to a realistic value.
resources:
requests:
memory: "128Mi"
limits:
memory: "256Mi"The relationship between requests and limits, and the point that the QoS class (BestEffort/Burstable/Guaranteed) affects which Pod dies first under OOM, apply directly from what was covered in the resource management post (#15). How to observe memory and CPU and set alerts in a production environment is organized along the metrics axis in the observability post.
The diagnostic flow on one page #
When you hit a Pod failure in the exam room, go down this order.
- Look at STATUS and RESTARTS with
k get pod -o wide - Always read the Events in
k describe podfirst - If STATUS is Pending → branch on the
FailedSchedulingphrase in Events into resources/selector/taint/PVC - If it’s an ImagePull family → branch on the
Failed to pull imagephrase in Events into tag/auth/network - If the container came up and died → check the app message with
k logs --previous - If describe’s Last State shows
OOMKilled/Exit Code: 137→ handle it as a memory limit problem
The starting point of this flow is always the same. Don’t guess — read describe’s Events first. This one habit takes half of Troubleshooting’s 30%.
Wrap-up #
What this post locked in:
- Troubleshooting is CKA’s largest domain (30%). The diagnostic speed of fixing what’s broken quickly is what divides your score
- The diagnostic tools are
k describe(Events),k logs --previous,k get events,k get pod -o wide. Always read describe’s Events first - Pending. The scheduler couldn’t find a node. Branch on the
FailedSchedulingphrase into insufficient resources / nodeSelector / taint / unbound PVC - CrashLoopBackOff. Came up, then died. Check the app error / wrong command / probe failure with
logs --previous - ImagePullBackOff / ErrImagePull. Couldn’t pull the image. Branch on Events into tag typo / registry auth / network
- OOMKilled. Memory limit exceeded.
Reason: OOMKilledand exit code 137 in describe’s Last State
Next: Troubleshooting 2 #
We’ve got the Pod level down. But sometimes a Pod won’t come up even with a perfectly fine manifest, and worse, the whole node falls into NotReady. Then you have to go one level lower, down to the node and kubelet.
In #23 Troubleshooting 2: Nodes and kubelet, we’ll track down the causes of a node going NotReady. We’ll diagnose and recover cases where the kubelet service died, the certificate or kubeconfig is out of sync, or the node is under disk pressure or memory pressure, going down through them with systemctl status kubelet and journalctl -u kubelet.