Certified Kubernetes Administrator (CKA) #25 Troubleshooting 4: Networking, DNS, RBAC, Certificate Expiry
In #24 Troubleshooting 3 we covered diagnosing dead control plane components and recovering etcd. What remains are the cases where the nodes are alive and the control plane is healthy, yet communication doesn’t go through or permissions are blocked. In this #25, the last troubleshooting post, we diagnose the networking and permissions territory in order: service communication failures, DNS resolution failures, RBAC denials, and certificate expiry.
What these scenarios have in common is that the error message doesn’t point directly at the cause. The Pod is Running but curl times out; the cluster is perfectly healthy but kubectl is suddenly denied. The key, therefore, is to make the diagnostic order — narrowing from symptom to cause — second nature.
When service communication fails #
This is the most common symptom. A client Pod connects to a service by name or ClusterIP and gets no response. Service communication flows through the stages name resolution → service → Endpoints → target Pod → network policy, so following that flow and cutting it one stage at a time narrows the cause.
Step 1: Separate whether it’s actually a network problem #
First, make the call directly from inside the client Pod. If it fails by service name, it might be a DNS problem, so also call it by ClusterIP to split the two cases apart.
# By service name from inside the client Pod
k exec -it client -- curl -sS http://my-svc:80
# The same call by ClusterIP (skips name resolution)
k get svc my-svc # check CLUSTER-IP
k exec -it client -- curl -sS http://10.96.0.42:80If the service name fails but the ClusterIP works, it’s a DNS problem — move on to the DNS section below. If the ClusterIP also fails, look at the service, the Endpoints, or the network policy.
Step 2: Check whether Endpoints is empty #
Half of service troubleshooting ends right here. The list of target Pods a service sends traffic to is the Endpoints (or EndpointSlice), and if it’s empty the service has only a ClusterIP with nothing to receive the traffic.
# Check Endpoints (<none> if empty)
k get endpoints my-svc
k get endpointslices -l kubernetes.io/service-name=my-svcIf Endpoints is empty, the cause is one of three.
| Cause | Check | Fix |
|---|---|---|
| selector mismatch | Compare the service selector with the Pod labels | Align the labels or the selector |
| target Pod not Ready | The READY column of k get pod -l app=... | Fix the readinessProbe or the app itself |
| targetPort mismatch | The service targetPort vs the container containerPort | Align the ports |
A selector mismatch is the most common. Compare the service’s selector with the Pod’s labels directly.
# The service's selector
k get svc my-svc -o jsonpath='{.spec.selector}'
# Whether that selector actually matches any Pod
k get pod --selector app=webIf the selector matches no Pod at all, the labels are off. Another common trap is the case where the Pod is Running but not Ready. Only Ready Pods make it into Endpoints, so if the readinessProbe fails, the Pod drops out of Endpoints even while Running.
Step 3: Check kube-proxy #
If Endpoints is populated but you still can’t reach the ClusterIP, suspect kube-proxy, which forwards the ClusterIP to the actual Pods. kube-proxy usually runs as a DaemonSet on each node, installing iptables or IPVS rules on the node.
# kube-proxy DaemonSet status
k -n kube-system get pods -l k8s-app=kube-proxy -o wide
# kube-proxy logs for a specific node
k -n kube-system logs ds/kube-proxyIf the kube-proxy Pod is dead on some nodes, you get a location-dependent symptom where only clients scheduled on those nodes can’t communicate. Check the status of the CNI plugin (Calico, Cilium, etc.) Pods the same way.
Step 4: Check whether a NetworkPolicy is blocking it #
If everything above is fine yet traffic is still blocked, there’s a good chance a NetworkPolicy is dropping it. The moment even one network policy applies to a Pod, it flips traffic that isn’t explicitly allowed to default-deny.
# Policy list in the target namespace
k -n prod get networkpolicy
# Details of a specific policy (which ingress/egress it allows)
k -n prod describe networkpolicy default-denyThere are two things to look at in describe: which Pods the podSelector catches, and which sources and destinations the Ingress/Egress rules allow. If there’s only a default-deny policy with no allow rules, all communication in that namespace is blocked. Another commonly missed trap is that both egress and ingress must be explicitly allowed. Even if the client’s egress is permitted, communication fails if the server’s ingress is blocked. Accidentally blocking DNS is also frequent — when you apply an egress policy and leave out UDP/TCP port 53 toward CoreDNS in kube-system, name resolution breaks first. The policy syntax and behavior are covered in detail in #20 Networking 3.
When DNS resolution fails #
This is the case where the service name fails but the ClusterIP succeeds, or where external domain resolution doesn’t work. Cluster DNS is handled by CoreDNS, so narrow down around it.
Start with CoreDNS Pod status #
# CoreDNS Pods and service
k -n kube-system get pods -l k8s-app=kube-dns -o wide
k -n kube-system get svc kube-dns
# CoreDNS logs (whether errors are logged)
k -n kube-system logs -l k8s-app=kube-dnsIf the CoreDNS Pod is in CrashLoop or there are zero of them running, name resolution breaks for the whole cluster. If the logs show messages like plugin/errors, check the Corefile setting in the ConfigMap.
k -n kube-system get configmap coredns -o yamlTest directly with nslookup #
Spin up a diagnostic Pod and resolve names directly from inside the cluster. If external images are blocked, you can also use nslookup or getent hosts from inside an existing Pod.
# Disposable diagnostic Pod
k run dnsutils --image=registry.k8s.io/e2e-test-images/jessie-dnsutils:1.3 \
--rm -it --restart=Never -- bash
# Resolve a cluster-internal service
nslookup my-svc.default.svc.cluster.local
# The default kubernetes service (if this fails, DNS itself is down)
nslookup kubernetes.default
# An external domain (if only this fails, it's an upstream/forward problem)
nslookup example.comThe resolution result splits the cause. If even kubernetes.default fails, the problem is CoreDNS or the kube-dns service itself; if internal resolution works but only external fails, it’s CoreDNS’s forward setting or the node’s upstream DNS.
Look at /etc/resolv.conf #
A Pod’s DNS settings go into /etc/resolv.conf. The nameserver should point at the ClusterIP of the kube-dns service, and the search domains should include the cluster suffix for short-name resolution to work.
# resolv.conf inside the Pod
k exec -it client -- cat /etc/resolv.confnameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5If the nameserver IP differs from the ClusterIP in k -n kube-system get svc kube-dns, DNS is pointing at the wrong place. If the search line is empty, short names like my-svc won’t resolve and only FQDNs will. If you suspect the node’s own DNS, check the node’s /etc/resolv.conf and the kubelet’s --resolv-conf setting as well.
DNS diagnostic order, summarized #
| Symptom | Suspect | Check command |
|---|---|---|
| All name resolution fails | CoreDNS Pod down | k -n kube-system get pods -l k8s-app=kube-dns |
| Internal works but external fails | CoreDNS forward / upstream | logs, configmap coredns |
| Only short names fail | resolv.conf search | k exec ... cat /etc/resolv.conf |
| Only one Pod fails | That Pod’s dnsPolicy/resolv.conf | The Pod spec’s dnsPolicy |
When RBAC denies you #
This is the case where kubectl or a ServiceAccount is blocked by a Forbidden error. The cluster is perfectly fine, so it’s a permissions problem rather than a network one.
Read the Forbidden error #
RBAC error messages have a fixed format, so the message itself is the starting point of the diagnosis.
Error from server (Forbidden): pods is forbidden:
User "dev" cannot list resource "pods" in API group "" in the namespace "prod"This one line contains all the information you need. The subject is User "dev", the verb is list, the resource is pods, the API group is "" (core), and the namespace is prod. In other words, the user dev doesn’t have permission to list pods in the prod namespace, so you just need to check whether there’s a Role and Binding that satisfies these five.
Check permissions with auth can-i #
Don’t guess about permissions — ask directly with auth can-i.
# My permissions
k auth can-i list pods -n prod
# A specific user/ServiceAccount's permissions (checked by an admin on their behalf)
k auth can-i list pods -n prod --as=dev
k auth can-i create deployments -n prod \
--as=system:serviceaccount:prod:builder
# List all permissions held
k auth can-i --list -n prod --as=devImpersonating another subject with --as to ask the question is the heart of the diagnosis. --list shows every permission that subject holds as a table, so you can see at a glance what’s missing.
Check whether a Role/Binding is missing #
When you get a no, pinpoint where the permission chain is broken. Permissions flow as subject → Binding → Role → rules, so there are three cases: no Binding, no Role, or the Role’s rules are insufficient.
# Roles and RoleBindings in the namespace
k -n prod get role,rolebinding
k -n prod describe rolebinding dev-binding
# If it's cluster-scoped
k get clusterrole,clusterrolebinding | grep -i devIn describe rolebinding, line up two things: whether the user or ServiceAccount in question is listed in Subjects with the exact name and kind, and whether the target the Role points at actually exists and contains the verbs and resources you need. Here are the common traps.
- The RoleBinding exists but the Role it points to doesn’t. A typo in roleRef or pointing at a deleted Role makes the permissions zero
- Confusing the kind of subject.
User,Group, andServiceAccountare different. A ServiceAccount must matchkind: ServiceAccountdown to the namespace - Scope mismatch. You need namespace permissions but only have a ClusterRole, or the reverse. Cluster-scoped permissions must be bound with a ClusterRoleBinding rather than a RoleBinding to apply across all namespaces
The RBAC object model and ServiceAccount tokens are covered in #9 RBAC, so here we’ll focus on the flow of narrowing down when you’re blocked.
When a certificate has expired #
The most bewildering symptom is kubectl being denied across the board one day. If the cluster is running but authentication fails, suspect certificate expiry first. The control plane certificates of a kubeadm cluster have a default validity of one year, so this often blows up on clusters left untouched for a long time without an upgrade.
Symptom: kubectl authentication failure #
Unable to connect to the server: x509: certificate has expired or is not yet validIf you see this x509 ... expired message, it’s almost certain. It’s a different message from a network error (connection refused), so there’s no confusing the two.
Confirm with check-expiration #
On the control plane node, view the expiry dates of all certificates at once with kubeadm.
# From inside the control plane node
kubeadm certs check-expirationCERTIFICATE EXPIRES RESIDUAL TIME EXTERNALLY MANAGED
admin.conf Jun 01, 2026 09:00 UTC <invalid> no
apiserver Jun 01, 2026 09:00 UTC <invalid> no
apiserver-etcd-client Jun 01, 2026 09:00 UTC <invalid> no
...If RESIDUAL TIME is <invalid> or negative, it has already expired. To check an individual certificate file directly, you can also read the expiry date with openssl.
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -enddateRenew and re-fetch kubeconfig #
Renew the expired certificates with kubeadm. After renewal you have to bring the control plane components (static Pods) back up with the new certificates, and copy the admin kubeconfig anew as well.
# Renew all certificates
kubeadm certs renew all
# Restart static Pods (briefly move the manifests out and back to force a restart)
# Or restart the kubelet to apply the new certificates
systemctl restart kubelet
# Use the renewed admin.conf again
cp /etc/kubernetes/admin.conf ~/.kube/config
# Check that it works again
kubeadm certs check-expiration
k get nodesIf kubectl throws the same error even right after renewal, it’s the case where the user kubeconfig in your hands still holds the old certificate as-is. Renewal rewrites /etc/kubernetes/admin.conf, so you need to copy that file to the user kubeconfig location for the change to take effect. The background on the PKI structure and the renewal procedure is covered in #8 Certificate Management.
Diagnostic order by symptom, at a glance #
Organizing the troubleshooting of the networking and permissions territory starting from the symptom gives the following. In the exam you’re only given the first column (the symptom), so being able to call up the next two columns straight from your head is what saves time.
| Symptom | Where to look first | Key command |
|---|---|---|
| No response by either service name or IP | Is Endpoints empty → selector | k get endpoints, k get svc -o jsonpath |
| Only the name fails, ClusterIP works | CoreDNS → resolv.conf | nslookup, k -n kube-system logs -l k8s-app=kube-dns |
| Endpoints is fine but it’s blocked | kube-proxy → NetworkPolicy | k -n kube-system get pods -l k8s-app=kube-proxy, k get netpol |
Forbidden error | The error’s 4 elements → Role/Binding | k auth can-i --as=, k get role,rolebinding |
x509 certificate has expired | Certificate expiry | kubeadm certs check-expiration |
kubectl connection refused | apiserver down (#24) | crictl ps, static Pod logs |
Exam points: troubleshooting wrap-up #
Let me tie together the four troubleshooting posts running since #22 from an exam perspective.
- Start from the symptom and cut one stage at a time. The first branch point is Endpoints for service communication, CoreDNS and resolv.conf for DNS, the Forbidden 4 elements for permissions, and the x509 message for authentication
- Check whether Endpoints is empty first. It’s the fastest-scoring check in service troubleshooting, and the cause is almost always a selector mismatch or a Pod that isn’t Ready
- Eliminate guesswork with
auth can-i --as. For permission problems, don’t ask — impersonate and ask directly to settle it. Use--listto see the missing permission at a glance - For certificates,
check-expirationis one line. Don’t forget to re-copy~/.kube/configafter renewal - Always reproduce and confirm after a fix. You have to verify by result that curl goes through again and that
k get nodesruns again — that’s what gets graded - Set the context first. Troubleshooting tasks are only graded if solved on the designated cluster/node, so run
use-contextfirst
Troubleshooting is the single largest domain at 30%, and it draws on knowledge from all the other domains. If you’ve followed along this far, your understanding of architecture, workloads, networking, and storage should have coalesced into a coherent diagnostic flow.
Next: exam tips #
We’ve gone a full loop through every domain, all the way through troubleshooting. What remains is how you operate your 2 hours.
In #26 Exam Tips, Time Management, and Patterns People Miss, we’ll gather how to decide the order of tasks, the strategy for going after partial credit, the criteria for when to skip a task you’re stuck on, and the recurring mistakes that eat away at the passing line — like selector mismatches, a missing context, and an un-copied kubeconfig.