Certified Kubernetes Administrator (CKA) #5 HA Clusters: Multiple Control Planes and an External etcd Cluster
In #4 Installing a kubeadm cluster, we bootstrapped a single control plane cluster. That cluster has one weakness: there’s only one control plane node, so if it dies, both the apiserver and etcd vanish with it. The Pods on the worker nodes keep running for a while, but new deployments, scaling, and recovery all stop. In this post we tackle the high-availability (HA) cluster that eliminates this single point of failure.
The heart of an HA cluster is scaling the control plane out to multiple machines. But a control plane isn’t just the apiserver — it also carries the etcd state store — so the first decision is whether to keep etcd inside the control plane nodes or outside them. That choice splits into two topologies: stacked etcd and external etcd. Add a load balancer in front of the apiservers and the concept of etcd quorum, and the full picture of an HA cluster comes together.
The single point of failure of a single control plane #
As we saw in #2, the control plane is made up of the apiserver, etcd, scheduler, and controller-manager. In a single-node cluster, all four run as static Pods on one node. Here’s what happens when that node goes down.
- apiserver stops.
kubectlcan no longer reach the cluster. New deployments, scaling, and deletions are all blocked. - etcd stops. All the cluster’s state lived on that one node, so if its disk is corrupted, you lose the cluster state itself.
- scheduler stops. New Pods aren’t placed on nodes. They pile up in the Pending state.
- controller-manager stops. The reconciliation loop halts and can no longer hold a Deployment at its desired state.
Pods already running on worker nodes stay alive for now, since the kubelet keeps running them locally. But with no way to operate the cluster, it’s effectively an outage. HA replicates the control plane across multiple machines, so that even when one dies, the rest keep operating the cluster.
Two topologies: stacked etcd and external etcd #
When you scale the control plane to multiple machines, the first thing to decide is where to put etcd. kubeadm supports two topologies.
Stacked etcd topology #
This approach runs an etcd member alongside each control plane node. If you bring up three control plane nodes, etcd forms a 3-member cluster inside them. Each apiserver talks to the local etcd member on the same node.
[control plane 1] apiserver + etcd
[control plane 2] apiserver + etcd
[control plane 3] apiserver + etcd
|
(front-end load balancer)
|
[worker nodes]The advantage of this approach is that it needs fewer nodes and a simpler setup. Adding control plane nodes makes both the control plane and etcd HA at the same time. It’s also kubeadm’s default topology. The downside is that the control plane and etcd are tied together in fate. When a node dies, that node’s apiserver and etcd member disappear together.
External etcd topology #
This approach separates etcd outside the control plane nodes, into a dedicated etcd cluster. The control plane nodes run only the apiserver/scheduler/controller-manager, while a separate set of 3 (or 5) nodes handles etcd.
[control plane 1] apiserver [etcd 1]
[control plane 2] apiserver [etcd 2]
[control plane 3] apiserver [etcd 3]
| |
(front-end load balancer) (apiservers connect to external etcd)
|
[worker nodes]The advantage of this approach is that failures of the control plane and etcd are isolated from each other. A control plane node can die while the etcd cluster stays healthy, and vice versa. The downside is that it needs more nodes and a more complex setup. You take on the extra work of standing up a separate etcd cluster and connecting it to the control plane with certificates.
Trade-offs at a glance #
| Item | stacked etcd | external etcd |
|---|---|---|
| Node count | Fewer (control plane = etcd) | More (control plane and etcd separated) |
| Setup complexity | Low (kubeadm default) | High (etcd stood up separately) |
| Fault isolation | Weak (apiserver+etcd on one node) | Strong (control plane and etcd separated) |
| Impact of node loss | apiserver and etcd member lost together | only apiserver lost, etcd unaffected |
| Recommended for | small scale, fast setup | large scale, high reliability requirements |
From a CKA standpoint, the key is to understand the trade-off: stacked is the default and simple, while external has strong fault isolation at the cost of complexity. Rather than asking you to build both from start to finish, the actual exam checks whether you understand this difference and etcd membership.
The front-end load balancer #
Once you scale the control plane to multiple machines, a question follows immediately: which apiserver should worker nodes and kubectl connect to? There are three apiservers, and if you hardcode the IP of just one of them, that node becomes a single point of failure again the moment it dies.
So an HA cluster puts a load balancer (LB) in front of the multiple apiservers. Worker nodes and clients connect to the LB’s single address (a virtual IP or domain), and the LB distributes requests to a live apiserver. When one apiserver dies, the LB drops that node and routes to the rest.
- HAProxy. A commonly used L4 load balancer placed in front of the apiservers. It distributes requests arriving on port 6443 to each control plane’s 6443.
- keepalived. Uses VRRP to float a virtual IP (VIP) between nodes, reducing the single point of failure of the LB itself. It’s often paired with HAProxy.
- Cloud load balancers. Managed L4 LBs provided by clouds, such as AWS NLB and GCP TCP LB, play the same role.
init with --control-plane-endpoint
#
The key is to pin this LB address into the cluster as a fixed control plane endpoint. When you bootstrap the first control plane, specify the LB address with --control-plane-endpoint.
kubeadm init \
--control-plane-endpoint "LOAD_BALANCER_DNS:6443" \
--upload-certs \
--pod-network-cidr=10.244.0.0/16When you set --control-plane-endpoint to the LB’s DNS or VIP, every kubeconfig and internal cluster setting generated afterward points at this single endpoint. That’s why you don’t have to change client configuration even when you add more control plane nodes. Conversely, if you run kubeadm init on a single node and then try to add control planes, the endpoint is pinned to one node’s IP, making the HA transition awkward. If you have HA in mind, you must specify --control-plane-endpoint from the very start.
--upload-certs is the option that uploads the control plane certificates to the cluster in encrypted form, so that control plane nodes joining later can download and use those certificates. It’s used in the join step in the next section.
etcd quorum and fault tolerance #
As important as the control plane in HA is how etcd behaves. etcd is a distributed key-value store where multiple members replicate the same data. But to keep the members’ data from diverging, a majority of the members must agree before a write is committed. This majority is called the quorum.
Quorum is calculated as (N / 2) + 1. With N members, a majority of them must be alive for the cluster to accept writes. If the live members fall short of a majority, etcd refuses writes to preserve data consistency, and the apiserver on top of it effectively halts as well.
| Members (N) | Quorum | Failures tolerated |
|---|---|---|
| 1 | 1 | 0 |
| 2 | 2 | 0 |
| 3 | 2 | 1 |
| 4 | 3 | 1 |
| 5 | 3 | 2 |
As the table shows, it’s efficient to keep the member count odd. With 2 members the quorum is 2, so the cluster halts if even one dies — it adds cost without improving fault tolerance over a single member. With 3 members, one can die and the remaining 2 still form a majority and operate normally. Four members tolerate just one failure, the same as three, so the extra node buys you nothing. That’s why the standard is to configure etcd with an odd number of members, like 3 or 5. Three members are enough for most clusters, and you use 5 members when you need higher fault tolerance.
Joining additional control plane nodes #
Once you’ve bootstrapped the first control plane against the LB endpoint, you join the second and third control planes. You take the kubeadm join command you used to add worker nodes and add the --control-plane flag and the certificate key.
kubeadm join LOAD_BALANCER_DNS:6443 \
--token <token> \
--discovery-token-ca-cert-hash sha256:<hash> \
--control-plane \
--certificate-key <certificate-key>Here’s the role of each argument.
LOAD_BALANCER_DNS:6443. The join target is the LB endpoint, not a specific control plane. The node enters the cluster through this address.--token/--discovery-token-ca-cert-hash. The same authentication info as a worker join, which verifies that the joining node is contacting the right cluster.--control-plane. The signal to join this node as a control plane rather than a worker. The apiserver/scheduler/controller-manager (and, in a stacked setup, an etcd member) come up on this node.--certificate-key. The key that downloads and decrypts the certificates uploaded earlier with--upload-certs. You need this key to receive the shared control plane certificates and join.
The token and certificate key the join command needs can be regenerated on the first node.
# re-upload the certificate key for control plane joins
kubeadm init phase upload-certs --upload-certs
# print the full join command (for a control plane)
kubeadm token create --print-join-commandThe certificate key expires after a set period for security, so it’s common to reissue it with the commands above when joining a new control plane.
Verification: checking nodes and etcd membership #
Once all the control planes are joined, check two things: whether the node list shows multiple control-plane roles, and whether there are that many etcd members.
Checking control plane nodes #
k get nodesNAME STATUS ROLES AGE VERSION
cp-1 Ready control-plane 30m v1.31.0
cp-2 Ready control-plane 12m v1.31.0
cp-3 Ready control-plane 8m v1.31.0
worker-1 Ready <none> 25m v1.31.0
worker-2 Ready <none> 25m v1.31.0If you see three control-plane entries in the ROLES column, control plane HA is configured. Confirm that they’re all Ready as well.
Checking etcd members #
In a stacked topology, the etcd members run as static Pods inside the control plane nodes. Use etcdctl member list to check the member count and status. etcd is protected with mTLS, so you have to pass the certificate paths along with the command.
ETCDCTL_API=3 etcdctl member list \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key<id1>, started, cp-1, https://10.0.0.11:2380, https://10.0.0.11:2379, false
<id2>, started, cp-2, https://10.0.0.12:2380, https://10.0.0.12:2379, false
<id3>, started, cp-3, https://10.0.0.13:2380, https://10.0.0.13:2379, falseIf there are three members and all are started, the etcd cluster is healthy and holds quorum. To see member status in more detail, endpoint health and endpoint status are also useful.
ETCDCTL_API=3 etcdctl endpoint health \
--cluster \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.keyThese commands come up again in #7 etcd backup and restore and #24 Control plane troubleshooting, so it’s worth getting comfortable with them along with the certificate paths.
Exam perspective #
It’s uncommon for the CKA hands-on exam to ask you to build an HA cluster from start to finish. Standing up an LB and joining three control planes within 2 hours takes far too long. What the exam asks more often is whether you understand the concepts and etcd membership.
- What the single point of failure of a single control plane is, and how HA removes it
- The difference and trade-offs between stacked etcd and external etcd
- Why etcd members are kept at an odd number, and how many failures 3 members can tolerate
- Why
--control-plane-endpointmust be specified from the start - Checking members with
etcdctl member list, and the flow of adding/removing members
In particular, working with etcd members via etcdctl member add / etcdctl member remove comes up again in #7 and #24, so we’ll make this post’s membership concepts the foundation for that.
Exam points #
- The purpose of HA is to remove the single point of failure of the control plane. You replicate the control plane across multiple machines so that even when one dies, the cluster keeps operating.
- stacked etcd is kubeadm’s default topology that places etcd alongside the control plane node — simple, but with weak fault isolation.
- external etcd separates etcd into a dedicated cluster to isolate failures, but needs more nodes and a more complex setup.
- The front-end load balancer ties multiple apiservers behind a single endpoint. You use HAProxy/keepalived or a cloud LB.
--control-plane-endpointpins the LB address as the cluster’s fixed endpoint. For HA, specify it from the very firstkubeadm init.- etcd quorum is
(N/2)+1. The standard is to keep members at an odd number (3 or 5); 3 members tolerate 1 failure, 5 members tolerate 2. - A control plane join adds
--control-planeand--certificate-keytokubeadm join. Certificates are shared via--upload-certs. - Verification uses
k get nodesto check for multiplecontrol-planeroles andetcdctl member listto check that all etcd members arestarted.
Wrap-up #
Here’s what this post locked in.
- A single control plane is a single point of failure where cluster operation halts when that node dies. HA removes this by replicating the control plane across multiple machines.
- There are two HA topologies: stacked etcd, which keeps etcd inside the control plane, and external etcd, which separates etcd into a dedicated cluster. You choose between them on the trade-off between simplicity and fault isolation.
- You place a load balancer in front of the multiple apiservers and use
--control-plane-endpointto set that address as the cluster’s fixed endpoint. - etcd preserves consistency with quorum (a majority), so you keep members at an odd number. Three members tolerate one failure.
- You join control plane nodes with
kubeadm join --control-planeplus a certificate key, and verify withk get nodesandetcdctl member list.
Next: cluster upgrades #
Now that we’ve scaled the control plane to multiple machines with HA, it’s time to upgrade the version of this cluster. When there are multiple control planes, the upgrade has to proceed one machine at a time, in order, to finish without downtime.
In #6 Cluster upgrades, we’ll work through the flow of checking the upgrade path with kubeadm upgrade plan, raising the control plane by one minor version with kubeadm upgrade apply, and then using kubectl drain and kubectl uncordon per node to drain workloads and upgrade them.