Certified Kubernetes Administrator (CKA) #23 Troubleshooting 2: Nodes and kubelet (NotReady, disk/memory pressure)

In #22 Troubleshooting 1 we dealt with Pod- and app-level failures. They were problems like Pending, CrashLoopBackOff, ImagePullBackOff, and OOMKilled — cases where the state of a single Pod is enough to narrow down the cause. This post goes one layer down: the situation where the node itself has dropped to NotReady.

A node failure has a different character from a Pod failure. When a single node goes NotReady, all the Pods that were running on it are affected at once, and the cause often isn’t visible from kubectl alone. You have to SSH into the node and inspect the kubelet and the container runtime at the system level. This is a classic area in CKA where you need a feel for Linux operations, and within the Troubleshooting domain it’s a type that tends to carry good point weight.

What does node NotReady mean #

Each node’s kubelet reports its status to the control plane on a regular interval. If that report is healthy the node is Ready; if the report goes silent within a set time or the kubelet signals a problem, the node is shown as NotReady. The first screen you look at is the node list.

k get nodes
NAME     STATUS     ROLES           AGE   VERSION
master   Ready      control-plane   40d   v1.31.0
node01   NotReady   <none>          40d   v1.31.0
node02   Ready      <none>          40d   v1.31.0

node01 is NotReady. You might want to SSH into the node right away, but the proper order is to first read the information the control plane has received. Even without connecting to the node, describe often reveals half the picture.

Step 1: read conditions with k describe node #

The starting point of node diagnosis is always describe. A node has several conditions, and each condition comes with its status, the time of its last transition, and a human-readable message.

k describe node node01

The first place to look in the output is the Conditions block.

Conditions:
  Type             Status    Reason                       Message
  ----             ------    ------                       -------
  MemoryPressure   False     KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False     KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False     KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            Unknown   NodeStatusUnknown            Kubelet stopped posting node status.

What the node conditions mean #

ConditionMeaningWhen True
ReadyThe node is healthy and can accept PodsHealthy. False/Unknown means unhealthy
MemoryPressureThe node’s available memory is below the thresholdMemory shortage. Eviction may occur
DiskPressureThe node’s disk capacity or inodes are below the thresholdDisk shortage. Image and Pod eviction occur
PIDPressureThe node’s available PIDs are below the thresholdProcess count saturated

Only Ready runs in the opposite direction from the other conditions. Ready=True is healthy, while the pressure-family conditions are healthy at False. In the example above, Ready is Unknown and the message is Kubelet stopped posting node status., which means the kubelet stopped reporting status. The pressure-family conditions are all False, so it isn’t a resource shortage. In this case the cause is almost certainly in the kubelet itself.

Interpreting each Ready status #

Ready StatusInterpretationNext action
TrueHealthyNot a node problem. Move to the Pod level
FalseThe kubelet is alive but the node reports unhealthyCheck the Reason in the message. Suspect runtime, network, or pressure
UnknownThe kubelet’s reporting has gone silentSSH into the node. Suspect a stopped kubelet, a downed node, or a network break

Unknown is a signal that the control plane has lost contact with the node. There are three branches: the kubelet died, the node was powered off, or the network between the node and the apiserver is blocked. From here you have to go directly onto the node.

Step 2: SSH into the node and check kubelet status #

In the exam, the hostname of the node to connect to is given in the question. After connecting via SSH, escalate with sudo if you need permissions.

ssh node01
sudo -i

The first target to check on the node is the kubelet service. The kubelet is the agent that starts and manages all Pods on the node, so if it stops, the entire node goes NotReady.

systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; ...)
   Active: inactive (dead) since Sat 2026-06-06 09:12:41 UTC; 3min ago

If it shows Active: inactive (dead), the kubelet is stopped. If it’s active (running) but the node is still NotReady, the kubelet is up but unable to function properly, so you need to look at the logs.

Reading kubelet logs with journalctl #

Because the kubelet runs as a systemd unit, you read its logs with journalctl. Skimming backward from the most recent logs is the fast way.

# Full kubelet log (pager)
journalctl -u kubelet

# Last 100 lines only
journalctl -u kubelet -n 100 --no-pager

# Live tail (good to watch while restarting kubelet)
journalctl -u kubelet -f

Searching the logs for keywords like error, failed, or unable to almost always surfaces the one line that holds the cause. The classics are a config file path error, certificate expiry, and a failure to connect to the runtime socket.

Common causes and fixes #

Node NotReady has five broad branches of cause. Let’s lay out the symptom and fix for each one.

Cause 1: the kubelet stopped #

This is the simplest case and one that shows up often in the exam. If systemctl status kubelet is inactive (dead), bring the kubelet back up.

systemctl start kubelet

# Also enable auto-start on boot
systemctl enable kubelet

# Re-check status
systemctl status kubelet

If it dies again right after you start it, the problem isn’t a simple stop but a configuration error. Check why it dies with journalctl -u kubelet -n 50 --no-pager and move on to the next item.

Cause 2: bad kubelet configuration #

The kubelet reads several config files when it starts. In the exam, one of these paths is sometimes deliberately broken.

PathRole
/var/lib/kubelet/config.yamlThe kubelet’s main config. cgroup driver, eviction thresholds, etc.
/etc/kubernetes/kubelet.confThe kubeconfig for connecting to the apiserver
/etc/systemd/system/kubelet.service.d/10-kubeadm.confThe arguments systemd uses to start the kubelet
/var/lib/kubelet/kubeadm-flags.envRuntime arguments kubeadm adds (e.g. the runtime socket path)

If you see a line in the logs like failed to load Kubelet config file or unable to load client CA file, inspect the paths and contents of the files above. One common trap is a wrong apiserver port or server address in the kubeconfig.

# Check the server the kubelet kubeconfig points to
cat /etc/kubernetes/kubelet.conf | grep server

After fixing a config file, have systemd re-read the change and then restart the kubelet.

systemctl daemon-reload
systemctl restart kubelet

Cause 3: certificate problems #

The kubelet uses a client certificate when it talks to the apiserver. If that certificate expires, the kubelet — even while up — can’t report its status to the apiserver, so the node goes NotReady. If you see a line in the logs like x509: certificate has expired or is not yet valid, it’s a certificate expiry.

journalctl -u kubelet -n 50 --no-pager | grep -i x509

Certificate expiry has enough to cover on its own, so in #25 Troubleshooting 4 we’ll separately lay out automatic and manual kubelet certificate renewal. For this post, it’s enough to remember that certificate expiry is one of the candidate causes of NotReady.

Cause 4: the container runtime stopped #

The kubelet doesn’t start containers itself — it delegates through the CRI to the container runtime (usually containerd). If the runtime dies, the kubelet can’t start Pods and the node goes NotReady. If you see failed to get container runtime or connection refused in the logs, check the runtime.

systemctl status containerd

# If it's stopped, bring it back up
systemctl start containerd
systemctl enable containerd

# Check that the runtime responds
crictl info
crictl ps

After bringing the runtime back up, restart the kubelet as well. If it’s a case of a wrong runtime socket path, check the --container-runtime-endpoint value in /var/lib/kubelet/kubeadm-flags.env.

systemctl restart kubelet

Cause 5: full disk and memory pressure #

Resource exhaustion shows up directly in the conditions, so you already get a clue from describe node. If DiskPressure=True, the disk has dropped below the threshold, and to protect the node the kubelet evicts Pods.

# Check disk usage
df -h

# Also check inode exhaustion (capacity left but inodes full)
df -i

If the disk is full, the safest reclaim target is unused container images.

# Prune unused images
crictl rmi --prune

# Clean up exited containers
crictl rm $(crictl ps -a -q --state Exited)

Bloated log files are also common, so check large files under /var/log at the same time. Once you free up disk, the kubelet reports DiskPressure=False again and the node returns to Ready.

MemoryPressure=True means the node’s available memory has dropped below the eviction threshold.

# Memory usage
free -h

# Processes using the most memory
top -o %MEM

If a particular Pod is using node memory excessively, the fundamental fix is to adjust that workload’s requests/limits (see #15 Resource Management) or spread it across other nodes. In the exam the cause is usually designed to be a single clear one, so focus on reclaiming the resource the condition points to.

Diagnosis table by symptom #

Let’s pull the whole flow so far into one view. When you hit NotReady, narrow it down by following this table.

Symptom / clueSuspected causeCheck commandFix
Ready=Unknown, “Kubelet stopped posting”kubelet stoppedsystemctl status kubeletsystemctl start/restart kubelet
kubelet starts then dies right awayconfig file errorjournalctl -u kubeletFix /var/lib/kubelet, kubelet.conf then daemon-reload
x509 ... expired in the logscertificate expiryjournalctl -u kubelet | grep x509Renew the certificate (#25)
runtime ... connection refused in the logsruntime stoppedsystemctl status containerdsystemctl start containerd then restart kubelet
DiskPressure=Truefull diskdf -h, df -icrictl rmi --prune, clean logs
MemoryPressure=Truememory pressurefree -h, topSpread workloads, adjust requests/limits
Ready=False, network-family ReasonCNI plugin problemNetworkPluginNotReady in kubelet logsCheck CNI Pod status (#20)

Isolating a problem node: cordon and drain #

While you’re fixing a node, there are times when you need to stop new Pods from being scheduled onto it, or safely empty the Pods running on it. The commands for this are cordon and drain.

# Only block new Pod scheduling (existing Pods stay)
k cordon node01

# Also move existing Pods to other nodes (includes cordon)
k drain node01 --ignore-daemonsets --delete-emptydir-data

# Allow scheduling again once the work is done
k uncordon node01

drain safely relocates the workloads on a node to other nodes before you inspect or upgrade it. DaemonSet Pods must run on every node, so skip them with --ignore-daemonsets. If there are Pods using emptyDir volumes, you must add --delete-emptydir-data — acknowledging the data loss — for the drain to proceed. Running drain automatically puts the node into a cordoned state.

A cordoned node shows as SchedulingDisabled in k get nodes.

NAME     STATUS                     ROLES    AGE   VERSION
node01   Ready,SchedulingDisabled   <none>   40d   v1.31.0

Once recovery is done, you must reopen scheduling with uncordon. Forget this step and the node will be Ready but receive no new Pods — which is easy to mistake later for yet another problem.

Taints on a NotReady node #

When a node goes NotReady, the control plane automatically attaches a taint to block Pod scheduling. Looking at which taint is attached lets you read the node state quickly.

k describe node node01 | grep -i taint
Taints:  node.kubernetes.io/not-ready:NoSchedule

The main node taints are as follows. Kubernetes assigns these automatically based on node state.

TaintAssignment condition
node.kubernetes.io/not-readyThe node’s Ready=False
node.kubernetes.io/unreachableThe node’s Ready=Unknown (contact lost)
node.kubernetes.io/disk-pressureDiskPressure=True
node.kubernetes.io/memory-pressureMemoryPressure=True
node.kubernetes.io/unschedulableA cordoned node

These taints are removed automatically when the node returns to healthy. That is, once you fix the kubelet and the node becomes Ready again, the not-ready taint disappears too and scheduling normalizes. The right answer is to fix the root cause, not to peel the taint off by hand. The behavior of taints and tolerations was covered in detail in #14 Scheduling 2.

The key is direction: start at the kubectl level (describe) and work down to the node level (systemctl/journalctl). Find the target with k get nodes, read the conditions with k describe node to set your direction, SSH into the node and fix the kubelet and runtime, and if you changed a config file, finish with systemctl daemon-reload then restart kubelet. This order applies to nearly every node failure.

Exam points #

  • Read the conditions from describe node first. Ready=Unknown means kubelet reporting has gone silent; a pressure-family True means resource exhaustion. Setting your direction before SSHing into the node saves time.
  • The kubelet is a systemd unit. systemctl status kubelet and journalctl -u kubelet have to be second nature. A stopped kubelet can be brought up in one shot with systemctl enable --now kubelet.
  • Don’t forget daemon-reload after fixing config. If you change a systemd unit file or its arguments and restart without daemon-reload, the change won’t take effect.
  • The runtime is a candidate too. Quickly rule out a stopped runtime with systemctl status containerd and crictl info.
  • Check disk with both df -h and df -i. It’s easy to miss the case where capacity is left but inodes fill up and trigger DiskPressure.
  • Don’t peel taints off by hand. Fix the cause and they disappear automatically. Force-removing a taint while the root cause remains just sends the node back to NotReady.

Wrap-up #

What this post locked in:

  • Diagnosing node NotReady starts at describe and goes down to the node system level. The conditions (Ready/MemoryPressure/DiskPressure/PIDPressure) set your direction
  • The kubelet is the node agent. Use systemctl status kubelet and journalctl -u kubelet to single out a stop, config error, certificate, or runtime problem
  • The five common causes. A stopped kubelet, config error (/var/lib/kubelet//etc/kubernetes/kubelet.conf), certificate expiry, a stopped runtime (containerd), and full disk and memory pressure
  • Isolate with cordon and drain. Safely empty workloads before inspection, and reopen scheduling with uncordon after recovery
  • The NotReady taint is automatic. node.kubernetes.io/not-ready is removed automatically once you fix the cause

Next — Troubleshooting 3 #

The node is fixed. Now we climb to a more dangerous layer: the situation where the control plane itself is down.

In #24 Troubleshooting 3: Control plane (apiserver/etcd/scheduler down), etcd recovery, we’ll cover how to directly inspect the static Pod manifests (/etc/kubernetes/manifests) and the container runtime when kube-apiserver doesn’t respond, what happens to the cluster when the scheduler and controller-manager die, and the procedure for recovering etcd from a snapshot when it breaks.

X