Certified Kubernetes Administrator (CKA) #23 Troubleshooting 2: Nodes and kubelet (NotReady, disk/memory pressure)
In #22 Troubleshooting 1 we dealt with Pod- and app-level failures. They were problems like Pending, CrashLoopBackOff, ImagePullBackOff, and OOMKilled — cases where the state of a single Pod is enough to narrow down the cause. This post goes one layer down: the situation where the node itself has dropped to NotReady.
A node failure has a different character from a Pod failure. When a single node goes NotReady, all the Pods that were running on it are affected at once, and the cause often isn’t visible from kubectl alone. You have to SSH into the node and inspect the kubelet and the container runtime at the system level. This is a classic area in CKA where you need a feel for Linux operations, and within the Troubleshooting domain it’s a type that tends to carry good point weight.
What does node NotReady mean #
Each node’s kubelet reports its status to the control plane on a regular interval. If that report is healthy the node is Ready; if the report goes silent within a set time or the kubelet signals a problem, the node is shown as NotReady. The first screen you look at is the node list.
k get nodesNAME STATUS ROLES AGE VERSION
master Ready control-plane 40d v1.31.0
node01 NotReady <none> 40d v1.31.0
node02 Ready <none> 40d v1.31.0node01 is NotReady. You might want to SSH into the node right away, but the proper order is to first read the information the control plane has received. Even without connecting to the node, describe often reveals half the picture.
Step 1: read conditions with k describe node #
The starting point of node diagnosis is always describe. A node has several conditions, and each condition comes with its status, the time of its last transition, and a human-readable message.
k describe node node01The first place to look in the output is the Conditions block.
Conditions:
Type Status Reason Message
---- ------ ------ -------
MemoryPressure False KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False KubeletHasSufficientPID kubelet has sufficient PID available
Ready Unknown NodeStatusUnknown Kubelet stopped posting node status.What the node conditions mean #
| Condition | Meaning | When True |
|---|---|---|
Ready | The node is healthy and can accept Pods | Healthy. False/Unknown means unhealthy |
MemoryPressure | The node’s available memory is below the threshold | Memory shortage. Eviction may occur |
DiskPressure | The node’s disk capacity or inodes are below the threshold | Disk shortage. Image and Pod eviction occur |
PIDPressure | The node’s available PIDs are below the threshold | Process count saturated |
Only Ready runs in the opposite direction from the other conditions. Ready=True is healthy, while the pressure-family conditions are healthy at False. In the example above, Ready is Unknown and the message is Kubelet stopped posting node status., which means the kubelet stopped reporting status. The pressure-family conditions are all False, so it isn’t a resource shortage. In this case the cause is almost certainly in the kubelet itself.
Interpreting each Ready status #
Ready Status | Interpretation | Next action |
|---|---|---|
True | Healthy | Not a node problem. Move to the Pod level |
False | The kubelet is alive but the node reports unhealthy | Check the Reason in the message. Suspect runtime, network, or pressure |
Unknown | The kubelet’s reporting has gone silent | SSH into the node. Suspect a stopped kubelet, a downed node, or a network break |
Unknown is a signal that the control plane has lost contact with the node. There are three branches: the kubelet died, the node was powered off, or the network between the node and the apiserver is blocked. From here you have to go directly onto the node.
Step 2: SSH into the node and check kubelet status #
In the exam, the hostname of the node to connect to is given in the question. After connecting via SSH, escalate with sudo if you need permissions.
ssh node01
sudo -iThe first target to check on the node is the kubelet service. The kubelet is the agent that starts and manages all Pods on the node, so if it stops, the entire node goes NotReady.
systemctl status kubelet● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; ...)
Active: inactive (dead) since Sat 2026-06-06 09:12:41 UTC; 3min agoIf it shows Active: inactive (dead), the kubelet is stopped. If it’s active (running) but the node is still NotReady, the kubelet is up but unable to function properly, so you need to look at the logs.
Reading kubelet logs with journalctl #
Because the kubelet runs as a systemd unit, you read its logs with journalctl. Skimming backward from the most recent logs is the fast way.
# Full kubelet log (pager)
journalctl -u kubelet
# Last 100 lines only
journalctl -u kubelet -n 100 --no-pager
# Live tail (good to watch while restarting kubelet)
journalctl -u kubelet -fSearching the logs for keywords like error, failed, or unable to almost always surfaces the one line that holds the cause. The classics are a config file path error, certificate expiry, and a failure to connect to the runtime socket.
Common causes and fixes #
Node NotReady has five broad branches of cause. Let’s lay out the symptom and fix for each one.
Cause 1: the kubelet stopped #
This is the simplest case and one that shows up often in the exam. If systemctl status kubelet is inactive (dead), bring the kubelet back up.
systemctl start kubelet
# Also enable auto-start on boot
systemctl enable kubelet
# Re-check status
systemctl status kubeletIf it dies again right after you start it, the problem isn’t a simple stop but a configuration error. Check why it dies with journalctl -u kubelet -n 50 --no-pager and move on to the next item.
Cause 2: bad kubelet configuration #
The kubelet reads several config files when it starts. In the exam, one of these paths is sometimes deliberately broken.
| Path | Role |
|---|---|
/var/lib/kubelet/config.yaml | The kubelet’s main config. cgroup driver, eviction thresholds, etc. |
/etc/kubernetes/kubelet.conf | The kubeconfig for connecting to the apiserver |
/etc/systemd/system/kubelet.service.d/10-kubeadm.conf | The arguments systemd uses to start the kubelet |
/var/lib/kubelet/kubeadm-flags.env | Runtime arguments kubeadm adds (e.g. the runtime socket path) |
If you see a line in the logs like failed to load Kubelet config file or unable to load client CA file, inspect the paths and contents of the files above. One common trap is a wrong apiserver port or server address in the kubeconfig.
# Check the server the kubelet kubeconfig points to
cat /etc/kubernetes/kubelet.conf | grep serverAfter fixing a config file, have systemd re-read the change and then restart the kubelet.
systemctl daemon-reload
systemctl restart kubeletCause 3: certificate problems #
The kubelet uses a client certificate when it talks to the apiserver. If that certificate expires, the kubelet — even while up — can’t report its status to the apiserver, so the node goes NotReady. If you see a line in the logs like x509: certificate has expired or is not yet valid, it’s a certificate expiry.
journalctl -u kubelet -n 50 --no-pager | grep -i x509Certificate expiry has enough to cover on its own, so in #25 Troubleshooting 4 we’ll separately lay out automatic and manual kubelet certificate renewal. For this post, it’s enough to remember that certificate expiry is one of the candidate causes of NotReady.
Cause 4: the container runtime stopped #
The kubelet doesn’t start containers itself — it delegates through the CRI to the container runtime (usually containerd). If the runtime dies, the kubelet can’t start Pods and the node goes NotReady. If you see failed to get container runtime or connection refused in the logs, check the runtime.
systemctl status containerd
# If it's stopped, bring it back up
systemctl start containerd
systemctl enable containerd
# Check that the runtime responds
crictl info
crictl psAfter bringing the runtime back up, restart the kubelet as well. If it’s a case of a wrong runtime socket path, check the --container-runtime-endpoint value in /var/lib/kubelet/kubeadm-flags.env.
systemctl restart kubeletCause 5: full disk and memory pressure #
Resource exhaustion shows up directly in the conditions, so you already get a clue from describe node. If DiskPressure=True, the disk has dropped below the threshold, and to protect the node the kubelet evicts Pods.
# Check disk usage
df -h
# Also check inode exhaustion (capacity left but inodes full)
df -iIf the disk is full, the safest reclaim target is unused container images.
# Prune unused images
crictl rmi --prune
# Clean up exited containers
crictl rm $(crictl ps -a -q --state Exited)Bloated log files are also common, so check large files under /var/log at the same time. Once you free up disk, the kubelet reports DiskPressure=False again and the node returns to Ready.
MemoryPressure=True means the node’s available memory has dropped below the eviction threshold.
# Memory usage
free -h
# Processes using the most memory
top -o %MEMIf a particular Pod is using node memory excessively, the fundamental fix is to adjust that workload’s requests/limits (see #15 Resource Management) or spread it across other nodes. In the exam the cause is usually designed to be a single clear one, so focus on reclaiming the resource the condition points to.
Diagnosis table by symptom #
Let’s pull the whole flow so far into one view. When you hit NotReady, narrow it down by following this table.
| Symptom / clue | Suspected cause | Check command | Fix |
|---|---|---|---|
Ready=Unknown, “Kubelet stopped posting” | kubelet stopped | systemctl status kubelet | systemctl start/restart kubelet |
| kubelet starts then dies right away | config file error | journalctl -u kubelet | Fix /var/lib/kubelet, kubelet.conf then daemon-reload |
x509 ... expired in the logs | certificate expiry | journalctl -u kubelet | grep x509 | Renew the certificate (#25) |
runtime ... connection refused in the logs | runtime stopped | systemctl status containerd | systemctl start containerd then restart kubelet |
DiskPressure=True | full disk | df -h, df -i | crictl rmi --prune, clean logs |
MemoryPressure=True | memory pressure | free -h, top | Spread workloads, adjust requests/limits |
Ready=False, network-family Reason | CNI plugin problem | NetworkPluginNotReady in kubelet logs | Check CNI Pod status (#20) |
Isolating a problem node: cordon and drain #
While you’re fixing a node, there are times when you need to stop new Pods from being scheduled onto it, or safely empty the Pods running on it. The commands for this are cordon and drain.
# Only block new Pod scheduling (existing Pods stay)
k cordon node01
# Also move existing Pods to other nodes (includes cordon)
k drain node01 --ignore-daemonsets --delete-emptydir-data
# Allow scheduling again once the work is done
k uncordon node01drain safely relocates the workloads on a node to other nodes before you inspect or upgrade it. DaemonSet Pods must run on every node, so skip them with --ignore-daemonsets. If there are Pods using emptyDir volumes, you must add --delete-emptydir-data — acknowledging the data loss — for the drain to proceed. Running drain automatically puts the node into a cordoned state.
A cordoned node shows as SchedulingDisabled in k get nodes.
NAME STATUS ROLES AGE VERSION
node01 Ready,SchedulingDisabled <none> 40d v1.31.0Once recovery is done, you must reopen scheduling with uncordon. Forget this step and the node will be Ready but receive no new Pods — which is easy to mistake later for yet another problem.
Taints on a NotReady node #
When a node goes NotReady, the control plane automatically attaches a taint to block Pod scheduling. Looking at which taint is attached lets you read the node state quickly.
k describe node node01 | grep -i taintTaints: node.kubernetes.io/not-ready:NoScheduleThe main node taints are as follows. Kubernetes assigns these automatically based on node state.
| Taint | Assignment condition |
|---|---|
node.kubernetes.io/not-ready | The node’s Ready=False |
node.kubernetes.io/unreachable | The node’s Ready=Unknown (contact lost) |
node.kubernetes.io/disk-pressure | DiskPressure=True |
node.kubernetes.io/memory-pressure | MemoryPressure=True |
node.kubernetes.io/unschedulable | A cordoned node |
These taints are removed automatically when the node returns to healthy. That is, once you fix the kubelet and the node becomes Ready again, the not-ready taint disappears too and scheduling normalizes. The right answer is to fix the root cause, not to peel the taint off by hand. The behavior of taints and tolerations was covered in detail in #14 Scheduling 2.
The key is direction: start at the kubectl level (describe) and work down to the node level (systemctl/journalctl). Find the target with k get nodes, read the conditions with k describe node to set your direction, SSH into the node and fix the kubelet and runtime, and if you changed a config file, finish with systemctl daemon-reload then restart kubelet. This order applies to nearly every node failure.
Exam points #
- Read the conditions from
describe nodefirst.Ready=Unknownmeans kubelet reporting has gone silent; a pressure-familyTruemeans resource exhaustion. Setting your direction before SSHing into the node saves time. - The kubelet is a systemd unit.
systemctl status kubeletandjournalctl -u kubelethave to be second nature. A stopped kubelet can be brought up in one shot withsystemctl enable --now kubelet. - Don’t forget
daemon-reloadafter fixing config. If you change a systemd unit file or its arguments and restart withoutdaemon-reload, the change won’t take effect. - The runtime is a candidate too. Quickly rule out a stopped runtime with
systemctl status containerdandcrictl info. - Check disk with both
df -handdf -i. It’s easy to miss the case where capacity is left but inodes fill up and trigger DiskPressure. - Don’t peel taints off by hand. Fix the cause and they disappear automatically. Force-removing a taint while the root cause remains just sends the node back to NotReady.
Wrap-up #
What this post locked in:
- Diagnosing node NotReady starts at describe and goes down to the node system level. The conditions (
Ready/MemoryPressure/DiskPressure/PIDPressure) set your direction - The kubelet is the node agent. Use
systemctl status kubeletandjournalctl -u kubeletto single out a stop, config error, certificate, or runtime problem - The five common causes. A stopped kubelet, config error (
/var/lib/kubelet//etc/kubernetes/kubelet.conf), certificate expiry, a stopped runtime (containerd), and full disk and memory pressure - Isolate with cordon and drain. Safely empty workloads before inspection, and reopen scheduling with
uncordonafter recovery - The NotReady taint is automatic.
node.kubernetes.io/not-readyis removed automatically once you fix the cause
Next — Troubleshooting 3 #
The node is fixed. Now we climb to a more dangerous layer: the situation where the control plane itself is down.
In #24 Troubleshooting 3: Control plane (apiserver/etcd/scheduler down), etcd recovery, we’ll cover how to directly inspect the static Pod manifests (/etc/kubernetes/manifests) and the container runtime when kube-apiserver doesn’t respond, what happens to the cluster when the scheduler and controller-manager die, and the procedure for recovering etcd from a snapshot when it breaks.