27 Chapter

kubectl Debugging Patterns

The first chapter of Part 5 (Operations · Debugging · Cost). It collects the diagnostic trees for the incidents you meet most often on a production cluster (CrashLoopBackOff, OOMKilled, ImagePullBackOff, Pending, a Service that won't reach). Starting from the three commands describe · events · logs, it ties together kubectl debug's ephemeral container, network diagnostic patterns, and the Chapter 19 observability stack into a manual that becomes a junior SRE's first reference.

This is the first chapter of Part 5 (Operations · Debugging · Cost). Having come through Part 4, we’ve arrived at a state where myshop-api runs on an EKS cluster, but real operations are never free of incidents. How each incident appears, where to look first, and which tool shows which signal are things automation can’t solve. This chapter organizes that into a single debugging manual.

More than half of the incidents one person meets in a year on a production cluster boil down to five patterns — CrashLoopBackOff, OOMKilled, ImagePullBackOff, Pending, a Service that won’t reach. With the diagnostic trees for these five in your head, even when PagerDuty goes off at dawn you can narrow the first-pass cause candidates within 5 minutes. The goal of this chapter is a state where those five trees, plus the tools layered on top — kubectl debug · ephemeral container · network diagnostics — all live in one person’s head.

The starting line of debugging — three commands #

The starting point of almost all debugging is the following three commands. Their order is fixed, and each command shows a different aspect.

the debugging trio

kubectl describe pod <name> -n <ns>
kubectl get events -n <ns> --sort-by='.lastTimestamp'
kubectl logs <pod> -n <ns> [-c <container>] [--previous]

The responsibility of each command in one line:

describe — the Pod’s current spec + status + recent events on one screen. The answer is often visible in the first five lines of an incident.
events — the cluster-wide chronological events. Who said what among scheduler · kubelet · controller is captured as one flow.
logs — the container’s own stdout / stderr. The application’s real error is here.

Only incidents that the combination of these three doesn’t solve move on to kubectl debug or the observability stack. When you see an incident, always reach for these three first is the first rule of operational debugging.

The meaning of Pod states — nine of them and the next action #

The states that appear most often in the STATUS column of kubectl get pods are organized in one table.

State	Meaning	First action
`Pending`	waiting to be scheduled	scheduler message in `describe`’s Events
`ContainerCreating`	kubelet is preparing the container	`describe`’s Events (image pull, volume mount)
`Running`	the container is running (but not the same as ready)	check the READY column with `kubectl get pod -o wide`
`CrashLoopBackOff`	the container crashes repeatedly	`logs --previous` and the exit code
`OOMKilled`	terminated for exceeding the memory limit	reason in events + the Chapter 11 limit
`ImagePullBackOff`	couldn’t pull the image	reason in events (permission / registry / tag)
`Error`	terminated (not restarted)	restartPolicy and the exit code
`Completed`	terminated successfully (the normal state of a Job)	normal — check whether it’s a Job
`Init:Error`, etc.	failure at the initContainer stage	`logs -c <init-container-name>`

If the READY column reads 0/1 while STATUS is Running, that’s a signal the readinessProbe of Chapter 12, Health Checks is failing. STATUS being Running does not mean healthy. It’s the one cell most often missed in operational debugging.

Reading the events section of describe pod #

The Events section at the end of the kubectl describe pod output shows 90 % of an incident. The meaning differs by which component sent it.

meaning of the Events section by source

Source              | what it shows
default-scheduler   | the scheduling stage — the cause of Pending (taint, affinity, resources, PVC)
kubelet             | the on-node stage — image pull, mount, probe, OOM
controller-manager  | actions of the upper controller — ReplicaSet's Pod create / delete
attachdetach        | the mount stage of EBS / EFS volumes

The sources you see most often in operational debugging are kubelet and default-scheduler. kubelet’s Failed to pull image and default-scheduler’s 0/3 nodes are available: 3 Insufficient memory directly tell you the real cause of ImagePullBackOff and Pending, respectively.

Patterns of kubectl logs #

the commonly used options of logs

# live stream
kubectl logs -f <pod>

# the last N lines
kubectl logs --tail=200 <pod>

# multi-container Pod
kubectl logs <pod> -c <container>

# the container that just died (the key to CrashLoopBackOff)
kubectl logs <pod> --previous

# by Deployment / Label
kubectl logs -l app.kubernetes.io/name=myshop-api --tail=100

95 % of CrashLoopBackOff debugging is the one option --previous. The current container couldn’t even start, so there’s no current log; you have to look at the log of the container that just died to get the real error message. Not knowing this one line drops you into the trap of “the log is empty and I don’t know what to look at.”

The pattern of viewing the logs of several Pods at once with a label selector runs naturally thanks to the standard app.kubernetes.io/name label of Chapter 22, The App Deployment Skeleton.

The limits of kubectl exec, and kubectl debug #

exec — into a live container

kubectl exec -it <pod> -n <ns> -- /bin/sh

The limits of exec are clear.

If the container is dead, you can’t exec — you can’t exec into a CrashLoopBackOff Pod.
distroless / scratch images have no shell — exec is meaningless on a container with no /bin/sh.
Debugging tools aren’t in the container — tools like curl, dig, and tcpdump are usually absent.

The tool that solves these three limits is kubectl debug. It injects an ephemeral container into the same namespace of the Pod and runs diagnostics there.

kubectl debug — ephemeral container

# inject a busybox container into the Pod
kubectl debug -it <pod> -n <ns> \
  --image=busybox:1.36 \
  --target=<container-name>

# view the filesystem of a distroless Pod (--target shares the same PID namespace)
kubectl debug -it <pod> -n <ns> \
  --image=nicolaka/netshoot \
  --target=<container-name>

The --target option is the key — it shares the same PID namespace so you can see the original container’s processes, and adding the --profile=netadmin option enables network diagnostics too.

kubectl debug — diagnosing the node itself

kubectl debug node/<node-name> -it --image=busybox

A container with the node’s host filesystem mounted at /host comes up, letting you view kubelet logs in /host/var/log/. It’s a decisive tool for node-level incidents like a mount error of the EBS CSI Driver.

The CrashLoopBackOff diagnostic tree #

The most common incident. The Pod dies right after starting, starts again, and dies again in a cycle.

CrashLoopBackOff diagnostic order

1. kubectl logs <pod> --previous
   -> the stderr of the previous container is the real cause. 95% of the time the answer is here.

2. kubectl describe pod <pod>
   -> check Last State / Reason / Exit Code in Events.
   -> Exit Code 137 = SIGKILL (OOM or forced termination)
   -> Exit Code 139 = SIGSEGV (segfault)
   -> Exit Code 1 = a general error

3. suspect a probe failure
   -> if the liveness in [Chapter 12, Health Checks](./health-checks/) fails too fast,
      the kubelet kills the container even though the application is alive.
   -> check whether initialDelaySeconds is enough.

4. suspect the initContainer stage
   -> if STATUS is Init:Error, the init stage failed.
   -> kubectl logs <pod> -c <init-container-name>

5. missing ConfigMap / Secret
   -> "MountVolume.SetUp failed" in describe Events.
   -> check the name match in [Chapter 6, ConfigMap · Secret](./configmap-and-secret/).

It also helps to know the exponential growth of the backoff — it grows 10s → 20s → 40s → … → up to a max of 5 minutes. It may be not “the same incident hasn’t resolved for an hour” but “a normal state where the kubelet retries once every 5 minutes.”

The OOMKilled diagnostic tree #

OOMKilled diagnostic order

1. kubectl get events -n <ns> --field-selector reason=OOMKilling
   -> a chronological list of which Pod got OOMKilled when.

2. kubectl describe pod <pod>
   -> Last State -> Terminated -> Reason: OOMKilled, Exit Code: 137.

3. check limits.memory in [Chapter 11, Resource Requests and Limits](./resources-and-limits/)
   -> terminated at the point the container's actual usage exceeded limits.
   -> check the pattern with the container_memory_working_set_bytes metric
      of [Chapter 25, Monitoring · Alerts](./monitoring-and-alerts/).

4. suspect a memory leak
   -> an upward-sloping straight line on the time graph means a code-level leak.
   -> additional diagnosis with a heap profile (Java jmap, Go pprof, Python memray).

5. suspect wrong limits
   -> the JVM's -Xmx may be set larger than the container limits.
   -> a native library's memory use is separate, outside the heap.

In a cgroup v2 environment (kernel 4.5+ / EKS 1.25+ default) the behavior of OOMKilled is slightly different. In cgroup v1, PID 1 inside the container died, but in v2 the process using the most memory may die first. In a multi-process container you need to check which process died.

The ImagePullBackOff diagnostic tree #

ImagePullBackOff diagnostic order

1. kubectl describe pod <pod>
   -> narrow to 6 causes by the reason in Events:

   - "manifest unknown" / "not found"
     -> a typo in the image tag, or a failed push to ECR.
        check the ECR push stage in [Chapter 24, CI/CD](./cicd-pipeline/).

   - "unauthorized" / "denied"
     -> insufficient ECR permission. check the Node IAM Role's ECR read permission,
        or check the imagePullSecret credentials.

   - "no basic auth credentials"
     -> the imagePullSecret itself is missing or its name has a typo.

   - "x509: certificate signed by unknown authority"
     -> a certificate trust problem with a private registry.

   - "context deadline exceeded"
     -> network latency. check the route through the NAT Gateway / VPC Endpoint.

   - "ECR repository ... does not exist"
     -> pointing at the ECR of a different region / different account.

In an EKS environment, having the AmazonEC2ContainerRegistryReadOnly policy attached to the node IAM Role in Chapter 21, EKS Cluster Setup is the default permission path for ECR pull. If you use the ECR of a different account, you have to explicitly open cross-account permission in the repository policy.

The Pending diagnostic tree #

Pending diagnostic order

1. kubectl describe pod <pod>
   -> the default-scheduler message in Events tells you the answer immediately.

2. "0/N nodes are available: ... Insufficient cpu/memory"
   -> the requests in [Chapter 11](./resources-and-limits/) exceed the node's available capacity.
   -> check whether the Karpenter / Cluster Autoscaler of [Chapter 13, Autoscaling](./autoscaling/)
      is bringing up a new node.

3. "Insufficient nodes match node selector / affinity"
   -> a typo in the nodeSelector's label key, or no node has that label.
   -> check labels with kubectl get nodes --show-labels.

4. "had taints that the pod didn't tolerate"
   -> a mismatch between the node's taint and the Pod's toleration.
   -> e.g., Karpenter's NodePool brings up spot only but the Pod has no toleration.

5. "pod has unbound immediate PersistentVolumeClaims"
   -> a failure of dynamic provisioning in [Chapter 9, PV/PVC/StorageClass](./pv-pvc-storageclass/).
   -> check whether the StorageClass's EBS CSI Driver is working normally.

6. suspect Karpenter's response time
   -> 30 seconds ~ 2 minutes until a new node is provisioned.
   -> Pending during that time is normal.

Diagnosing Pending narrows to the one question “why did the scheduler reject it?” The one line in describe’s Events always holds the answer — a different angle from OOMKilled / CrashLoop.

When a Service / Ingress won’t reach #

diagnostic tree for when a Service won't reach

1. is the Pod itself ready?
   -> kubectl get pods -l <selector> -o wide
   -> whether READY is 1/1 and STATUS is Running.
   -> if the readinessProbe is failing, it is automatically excluded from Endpoints.

2. does the Service's selector match the Pod labels?
   -> kubectl describe service <svc>
   -> kubectl get endpoints <svc> -- if Endpoints is empty, the selector doesn't match.

3. does the Service's port match the Pod's containerPort?
   -> targetPort is the port inside the Pod (or the port name).
   -> port is the Service's own virtual port.

4. is a NetworkPolicy blocking it?
   -> check the ingress rule in [Chapter 14, RBAC/NetworkPolicy/ResourceQuota](./rbac-networkpolicy-quota/).

5. did the Ingress's ALB actually come up?
   -> kubectl describe ingress -- the Address field.
   -> kubectl logs -n kube-system deployment/aws-load-balancer-controller.
   -> the healthy / unhealthy state of the ALB target group (AWS console).

6. check DNS resolution itself
   -> kubectl run -it --rm dns-test --image=busybox:1.36 \
        -- nslookup <svc>.<ns>.svc.cluster.local

The 3-stage chain of selector → endpoints → port from Chapter 5, Service is the basic thought model of operational debugging. Narrowing down which of the three is broken is the core of Service debugging.

Network diagnostic tools #

bringing up a temporary diagnostic Pod

# the lightest temporary container
kubectl run -it --rm net-test --image=busybox:1.36 \
  -n <ns> -- /bin/sh

# the full package (curl, dig, traceroute, tcpdump, nslookup)
kubectl run -it --rm net-test --image=nicolaka/netshoot \
  -n <ns> -- /bin/bash

nicolaka/netshoot is effectively the standard diagnostic image. You can solve all of the following scenarios inside one container.

diagnostic commands inside netshoot

# DNS
dig myshop-api.myshop.svc.cluster.local

# HTTP connection
curl -v http://myshop-api.myshop:80/health/ready

# path tracing
traceroute api.myshop.example.com

# routing of node IP and Pod IP
ip route

Thanks to the VPC CNI model of Chapter 15, CNI in Depth (a structure where the Pod receives the IP of an ENI directly), in EKS the Pod’s IP enters the node’s routing table naturally. So tcpdump can catch Pod traffic even on the node’s host network.

tcpdump on a node via kubectl debug

kubectl debug node/<node-name> -it --image=nicolaka/netshoot
# inside it:
tcpdump -i any -n 'host <pod-ip>'

With this tool, you can trace in one go where on the ALB → Pod path of Chapter 10, Ingress the packet disappears.

Combining with observability — when do you go where #

Whether you should go to kubectl describe or a Grafana dashboard first depends on the shape of the incident.

single Pod vs distributed pattern

[A single Pod's problem] — describe / logs / events is faster
  - CrashLoopBackOff, ImagePullBackOff, OOMKilled
  - "only one specific Pod is failing"

[A distributed / statistical problem] — the observability stack is faster
  - "the error rate exceeds 5%"
  - "P95 latency exceeds 1 second"
  - "traffic is 30% versus the last hour"
  - correlation with other workloads at the time of the incident

The stacks of Chapter 19, Observability and Chapter 25, Monitoring · Alerts combine with this chapter’s single-Pod debugging to form the two axes of incident response. Once Grafana’s distributed trace narrows it down to “which handler of which service is slow,” the next kubectl logs then shows you that Pod’s real stderr.

The ArgoCD UI of Chapter 24, The CI / CD Pipeline is also one of the debugging tools. Clicking an Application in the OutOfSync state visually shows which resource differs from the desired state — often faster than describe’s text output.

The standard flow of one incident cycle #

We organize the standard flow of how to weave this chapter’s tools into a single incident.

the standard 5 minutes after receiving an alert

1. check the alert body in PagerDuty / Slack (alertname, severity, runbook_url)
2. follow the "first-pass check" section of the runbook
3. confirm the incident's scope with a Grafana dashboard (one Pod / one Service / the whole)
4. if the scope is one Pod, kubectl describe + logs --previous
   if the scope is one Service, endpoints + NetworkPolicy
   if the scope is the whole, node / CNI / cluster components
5. first-pass response (scale up, restart, block traffic)
6. share status in the Slack incident channel + prepare a post-incident RCA

With this flow in one person’s head, first-pass diagnosis is possible within 5 minutes even in a dawn incident. Having the runbook_url of Chapter 25 connect directly to this chapter’s diagnostic trees is the goal of a production cluster.

Exercises #

Deliberately apply a wrong image tag (myshop-api:does-not-exist) to the myshop-api Deployment in the dev cluster to create a state where the Pod goes into ImagePullBackOff. Record which reason appears in the Events section of kubectl describe pod, and organize which of the 6 causes in this chapter’s §“The ImagePullBackOff diagnostic tree” it corresponds to. Reproduce the same incident deliberately as an insufficient-permission case too (an IAM Role with no ECR access), and compare the difference in reason between the two.
Deploy a container that gradually leaks memory (e.g., Python’s while True: data.append(...)) with a small limits.memory to create the shape of an OOMKilled. Catch the time with kubectl get events --field-selector reason=OOMKilling, and tune the threshold so the MyshopApiPodMemoryHigh alert of Chapter 25, Monitoring · Alerts fires before the OOMKilled. In one paragraph, organize the operational-value difference between an alert that predicts the incident and an alert that fires after the incident has occurred.
Deliberately change the Service’s selector to a one-letter typo (“app.kubernetes.io/name: myshop-apia”) to create a state where kubectl get endpoints is empty. Follow this chapter’s §“When a Service / Ingress won’t reach” 6-step tree from the top down, record at which step the incident is found, and compare it with the path of finding the same incident via the observability stack (Grafana’s traffic panel + Loki logs). In one paragraph, organize which is faster and why.

In one line: The starting line of production cluster debugging is the three commands describe + events + logs, and kubectl debug’s ephemeral container fills the gaps left by distroless images, dead containers, and node diagnosis. CrashLoopBackOff is the case for logs --previous, OOMKilled is the Chapter 11 limits and metrics, ImagePullBackOff is the 6 reasons in events, Pending is the one line from default-scheduler, and Service is the 3-stage chain of selector → endpoints → port — these are the core diagnostic trees. A single Pod’s problem is faster with kubectl, and a distributed · statistical problem is faster with the observability stack — choosing the right tool for the incident is the operational standard.

Next chapter #

In this chapter we organized where to look first when an incident occurs into a single manual. In the next chapter we deal not with incidents but with the bill.

Chapter 28, Cost Optimization covers the cost items we pointed at through five sources in Chapter 26, The Operations Checklist. It covers the two axes of compute and the add-ons, the cost meaning of requests, the right-sizing of VPA / Goldilocks, the decision tree of Karpenter and Cluster Autoscaler, and cost allocation by namespace · label. It’s the stage where the decisions of Chapter 13, Autoscaling and Chapter 11, Resource Requests and Limits carry over into real operational cost.