Certified Kubernetes Administrator (CKA) #25 Troubleshooting 4: Networking, DNS, RBAC, Certificate Expiry

In #24 Troubleshooting 3 we covered diagnosing dead control plane components and recovering etcd. What remains are the cases where the nodes are alive and the control plane is healthy, yet communication doesn’t go through or permissions are blocked. In this #25, the last troubleshooting post, we diagnose the networking and permissions territory in order: service communication failures, DNS resolution failures, RBAC denials, and certificate expiry.

What these scenarios have in common is that the error message doesn’t point directly at the cause. The Pod is Running but curl times out; the cluster is perfectly healthy but kubectl is suddenly denied. The key, therefore, is to make the diagnostic order — narrowing from symptom to cause — second nature.

When service communication fails #

This is the most common symptom. A client Pod connects to a service by name or ClusterIP and gets no response. Service communication flows through the stages name resolution → service → Endpoints → target Pod → network policy, so following that flow and cutting it one stage at a time narrows the cause.

Step 1: Separate whether it’s actually a network problem #

First, make the call directly from inside the client Pod. If it fails by service name, it might be a DNS problem, so also call it by ClusterIP to split the two cases apart.

# By service name from inside the client Pod
k exec -it client -- curl -sS http://my-svc:80

# The same call by ClusterIP (skips name resolution)
k get svc my-svc          # check CLUSTER-IP
k exec -it client -- curl -sS http://10.96.0.42:80

If the service name fails but the ClusterIP works, it’s a DNS problem — move on to the DNS section below. If the ClusterIP also fails, look at the service, the Endpoints, or the network policy.

Step 2: Check whether Endpoints is empty #

Half of service troubleshooting ends right here. The list of target Pods a service sends traffic to is the Endpoints (or EndpointSlice), and if it’s empty the service has only a ClusterIP with nothing to receive the traffic.

# Check Endpoints (<none> if empty)
k get endpoints my-svc
k get endpointslices -l kubernetes.io/service-name=my-svc

If Endpoints is empty, the cause is one of three.

CauseCheckFix
selector mismatchCompare the service selector with the Pod labelsAlign the labels or the selector
target Pod not ReadyThe READY column of k get pod -l app=...Fix the readinessProbe or the app itself
targetPort mismatchThe service targetPort vs the container containerPortAlign the ports

A selector mismatch is the most common. Compare the service’s selector with the Pod’s labels directly.

# The service's selector
k get svc my-svc -o jsonpath='{.spec.selector}'

# Whether that selector actually matches any Pod
k get pod --selector app=web

If the selector matches no Pod at all, the labels are off. Another common trap is the case where the Pod is Running but not Ready. Only Ready Pods make it into Endpoints, so if the readinessProbe fails, the Pod drops out of Endpoints even while Running.

Step 3: Check kube-proxy #

If Endpoints is populated but you still can’t reach the ClusterIP, suspect kube-proxy, which forwards the ClusterIP to the actual Pods. kube-proxy usually runs as a DaemonSet on each node, installing iptables or IPVS rules on the node.

# kube-proxy DaemonSet status
k -n kube-system get pods -l k8s-app=kube-proxy -o wide

# kube-proxy logs for a specific node
k -n kube-system logs ds/kube-proxy

If the kube-proxy Pod is dead on some nodes, you get a location-dependent symptom where only clients scheduled on those nodes can’t communicate. Check the status of the CNI plugin (Calico, Cilium, etc.) Pods the same way.

Step 4: Check whether a NetworkPolicy is blocking it #

If everything above is fine yet traffic is still blocked, there’s a good chance a NetworkPolicy is dropping it. The moment even one network policy applies to a Pod, it flips traffic that isn’t explicitly allowed to default-deny.

# Policy list in the target namespace
k -n prod get networkpolicy

# Details of a specific policy (which ingress/egress it allows)
k -n prod describe networkpolicy default-deny

There are two things to look at in describe: which Pods the podSelector catches, and which sources and destinations the Ingress/Egress rules allow. If there’s only a default-deny policy with no allow rules, all communication in that namespace is blocked. Another commonly missed trap is that both egress and ingress must be explicitly allowed. Even if the client’s egress is permitted, communication fails if the server’s ingress is blocked. Accidentally blocking DNS is also frequent — when you apply an egress policy and leave out UDP/TCP port 53 toward CoreDNS in kube-system, name resolution breaks first. The policy syntax and behavior are covered in detail in #20 Networking 3.

When DNS resolution fails #

This is the case where the service name fails but the ClusterIP succeeds, or where external domain resolution doesn’t work. Cluster DNS is handled by CoreDNS, so narrow down around it.

Start with CoreDNS Pod status #

# CoreDNS Pods and service
k -n kube-system get pods -l k8s-app=kube-dns -o wide
k -n kube-system get svc kube-dns

# CoreDNS logs (whether errors are logged)
k -n kube-system logs -l k8s-app=kube-dns

If the CoreDNS Pod is in CrashLoop or there are zero of them running, name resolution breaks for the whole cluster. If the logs show messages like plugin/errors, check the Corefile setting in the ConfigMap.

k -n kube-system get configmap coredns -o yaml

Test directly with nslookup #

Spin up a diagnostic Pod and resolve names directly from inside the cluster. If external images are blocked, you can also use nslookup or getent hosts from inside an existing Pod.

# Disposable diagnostic Pod
k run dnsutils --image=registry.k8s.io/e2e-test-images/jessie-dnsutils:1.3 \
  --rm -it --restart=Never -- bash

# Resolve a cluster-internal service
nslookup my-svc.default.svc.cluster.local

# The default kubernetes service (if this fails, DNS itself is down)
nslookup kubernetes.default

# An external domain (if only this fails, it's an upstream/forward problem)
nslookup example.com

The resolution result splits the cause. If even kubernetes.default fails, the problem is CoreDNS or the kube-dns service itself; if internal resolution works but only external fails, it’s CoreDNS’s forward setting or the node’s upstream DNS.

Look at /etc/resolv.conf #

A Pod’s DNS settings go into /etc/resolv.conf. The nameserver should point at the ClusterIP of the kube-dns service, and the search domains should include the cluster suffix for short-name resolution to work.

# resolv.conf inside the Pod
k exec -it client -- cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

If the nameserver IP differs from the ClusterIP in k -n kube-system get svc kube-dns, DNS is pointing at the wrong place. If the search line is empty, short names like my-svc won’t resolve and only FQDNs will. If you suspect the node’s own DNS, check the node’s /etc/resolv.conf and the kubelet’s --resolv-conf setting as well.

DNS diagnostic order, summarized #

SymptomSuspectCheck command
All name resolution failsCoreDNS Pod downk -n kube-system get pods -l k8s-app=kube-dns
Internal works but external failsCoreDNS forward / upstreamlogs, configmap coredns
Only short names failresolv.conf searchk exec ... cat /etc/resolv.conf
Only one Pod failsThat Pod’s dnsPolicy/resolv.confThe Pod spec’s dnsPolicy

When RBAC denies you #

This is the case where kubectl or a ServiceAccount is blocked by a Forbidden error. The cluster is perfectly fine, so it’s a permissions problem rather than a network one.

Read the Forbidden error #

RBAC error messages have a fixed format, so the message itself is the starting point of the diagnosis.

Error from server (Forbidden): pods is forbidden:
User "dev" cannot list resource "pods" in API group "" in the namespace "prod"

This one line contains all the information you need. The subject is User "dev", the verb is list, the resource is pods, the API group is "" (core), and the namespace is prod. In other words, the user dev doesn’t have permission to list pods in the prod namespace, so you just need to check whether there’s a Role and Binding that satisfies these five.

Check permissions with auth can-i #

Don’t guess about permissions — ask directly with auth can-i.

# My permissions
k auth can-i list pods -n prod

# A specific user/ServiceAccount's permissions (checked by an admin on their behalf)
k auth can-i list pods -n prod --as=dev
k auth can-i create deployments -n prod \
  --as=system:serviceaccount:prod:builder

# List all permissions held
k auth can-i --list -n prod --as=dev

Impersonating another subject with --as to ask the question is the heart of the diagnosis. --list shows every permission that subject holds as a table, so you can see at a glance what’s missing.

Check whether a Role/Binding is missing #

When you get a no, pinpoint where the permission chain is broken. Permissions flow as subject → Binding → Role → rules, so there are three cases: no Binding, no Role, or the Role’s rules are insufficient.

# Roles and RoleBindings in the namespace
k -n prod get role,rolebinding
k -n prod describe rolebinding dev-binding

# If it's cluster-scoped
k get clusterrole,clusterrolebinding | grep -i dev

In describe rolebinding, line up two things: whether the user or ServiceAccount in question is listed in Subjects with the exact name and kind, and whether the target the Role points at actually exists and contains the verbs and resources you need. Here are the common traps.

  • The RoleBinding exists but the Role it points to doesn’t. A typo in roleRef or pointing at a deleted Role makes the permissions zero
  • Confusing the kind of subject. User, Group, and ServiceAccount are different. A ServiceAccount must match kind: ServiceAccount down to the namespace
  • Scope mismatch. You need namespace permissions but only have a ClusterRole, or the reverse. Cluster-scoped permissions must be bound with a ClusterRoleBinding rather than a RoleBinding to apply across all namespaces

The RBAC object model and ServiceAccount tokens are covered in #9 RBAC, so here we’ll focus on the flow of narrowing down when you’re blocked.

When a certificate has expired #

The most bewildering symptom is kubectl being denied across the board one day. If the cluster is running but authentication fails, suspect certificate expiry first. The control plane certificates of a kubeadm cluster have a default validity of one year, so this often blows up on clusters left untouched for a long time without an upgrade.

Symptom: kubectl authentication failure #

Unable to connect to the server: x509: certificate has expired or is not yet valid

If you see this x509 ... expired message, it’s almost certain. It’s a different message from a network error (connection refused), so there’s no confusing the two.

Confirm with check-expiration #

On the control plane node, view the expiry dates of all certificates at once with kubeadm.

# From inside the control plane node
kubeadm certs check-expiration
CERTIFICATE                EXPIRES                  RESIDUAL TIME   EXTERNALLY MANAGED
admin.conf                 Jun 01, 2026 09:00 UTC   <invalid>       no
apiserver                  Jun 01, 2026 09:00 UTC   <invalid>       no
apiserver-etcd-client      Jun 01, 2026 09:00 UTC   <invalid>       no
...

If RESIDUAL TIME is <invalid> or negative, it has already expired. To check an individual certificate file directly, you can also read the expiry date with openssl.

openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -enddate

Renew and re-fetch kubeconfig #

Renew the expired certificates with kubeadm. After renewal you have to bring the control plane components (static Pods) back up with the new certificates, and copy the admin kubeconfig anew as well.

# Renew all certificates
kubeadm certs renew all

# Restart static Pods (briefly move the manifests out and back to force a restart)
# Or restart the kubelet to apply the new certificates
systemctl restart kubelet

# Use the renewed admin.conf again
cp /etc/kubernetes/admin.conf ~/.kube/config

# Check that it works again
kubeadm certs check-expiration
k get nodes

If kubectl throws the same error even right after renewal, it’s the case where the user kubeconfig in your hands still holds the old certificate as-is. Renewal rewrites /etc/kubernetes/admin.conf, so you need to copy that file to the user kubeconfig location for the change to take effect. The background on the PKI structure and the renewal procedure is covered in #8 Certificate Management.

Diagnostic order by symptom, at a glance #

Organizing the troubleshooting of the networking and permissions territory starting from the symptom gives the following. In the exam you’re only given the first column (the symptom), so being able to call up the next two columns straight from your head is what saves time.

SymptomWhere to look firstKey command
No response by either service name or IPIs Endpoints empty → selectork get endpoints, k get svc -o jsonpath
Only the name fails, ClusterIP worksCoreDNS → resolv.confnslookup, k -n kube-system logs -l k8s-app=kube-dns
Endpoints is fine but it’s blockedkube-proxy → NetworkPolicyk -n kube-system get pods -l k8s-app=kube-proxy, k get netpol
Forbidden errorThe error’s 4 elements → Role/Bindingk auth can-i --as=, k get role,rolebinding
x509 certificate has expiredCertificate expirykubeadm certs check-expiration
kubectl connection refusedapiserver down (#24)crictl ps, static Pod logs

Exam points: troubleshooting wrap-up #

Let me tie together the four troubleshooting posts running since #22 from an exam perspective.

  • Start from the symptom and cut one stage at a time. The first branch point is Endpoints for service communication, CoreDNS and resolv.conf for DNS, the Forbidden 4 elements for permissions, and the x509 message for authentication
  • Check whether Endpoints is empty first. It’s the fastest-scoring check in service troubleshooting, and the cause is almost always a selector mismatch or a Pod that isn’t Ready
  • Eliminate guesswork with auth can-i --as. For permission problems, don’t ask — impersonate and ask directly to settle it. Use --list to see the missing permission at a glance
  • For certificates, check-expiration is one line. Don’t forget to re-copy ~/.kube/config after renewal
  • Always reproduce and confirm after a fix. You have to verify by result that curl goes through again and that k get nodes runs again — that’s what gets graded
  • Set the context first. Troubleshooting tasks are only graded if solved on the designated cluster/node, so run use-context first

Troubleshooting is the single largest domain at 30%, and it draws on knowledge from all the other domains. If you’ve followed along this far, your understanding of architecture, workloads, networking, and storage should have coalesced into a coherent diagnostic flow.

Next: exam tips #

We’ve gone a full loop through every domain, all the way through troubleshooting. What remains is how you operate your 2 hours.

In #26 Exam Tips, Time Management, and Patterns People Miss, we’ll gather how to decide the order of tasks, the strategy for going after partial credit, the criteria for when to skip a task you’re stuck on, and the recurring mistakes that eat away at the passing line — like selector mismatches, a missing context, and an un-copied kubeconfig.

X