Certified Kubernetes Administrator (CKA) #7: etcd Backup and Restore — etcdctl snapshot save/restore

Every bit of a Kubernetes cluster’s state gathers in one place. Deployments, Services, Secrets, RBAC rules, namespaces, node information — everything the cluster remembers lives in a key-value store called etcd. The apiserver is, in effect, just a gatekeeper sitting in front of etcd; the real data is held by etcd. So if etcd is lost, the cluster’s entire memory goes with it.

Because of this, etcd backup and restore shows up almost as a regular on the CKA hands-on exam. The grading point is clear, and it’s a task where one or two lines of correct commands decide your score — so getting it into your hands is a reliable source of points. Conversely, drop a single certificate flag or forget to reflect the new data-dir after restoring, and you get the whole thing wrong. In this post, we’ll walk through the entire process of taking a snapshot with etcdctl snapshot save and bringing it back with snapshot restore, calling out the traps along the way.

Where does etcd run #

In a cluster built with kubeadm, etcd runs as a static Pod on the control plane node. A static Pod is one that the kubelet brings up by directly watching a manifest directory on disk, without going through the apiserver. The etcd manifest sits at this location.

/etc/kubernetes/manifests/etcd.yaml

Everything you need for backup and restore is written inside this file. Rather than punching in values from memory, reading them from this file every time is how you avoid mistakes. You’re looking for these three things.

  • --data-dir: the directory where etcd actually writes its data. Usually /var/lib/etcd
  • The three certificate paths: --trusted-ca-file (cacert), --cert-file (cert), --key-file (key)
  • endpoint: the address etcd listens on. Usually https://127.0.0.1:2379

Let’s pull just those lines out of the manifest.

grep -E 'data-dir|cert-file|key-file|trusted-ca-file|listen-client-urls' \
  /etc/kubernetes/manifests/etcd.yaml

You’ll see roughly the following.

    - --data-dir=/var/lib/etcd
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --listen-client-urls=https://127.0.0.1:2379,https://10.0.0.10:2379

This directly fixes the values that go into the backup command. Certificate paths can differ from cluster to cluster, so it’s safest to make a habit of confirming them in this file rather than using a path you memorized for the exam room.

Preparing etcdctl #

You run the commands with etcdctl. Since you have to use the etcd v3 API, always set the environment variable ETCDCTL_API=3. In recent versions v3 is the default, but depending on the version in the exam environment it may operate as v2 and make the command fail entirely — so spelling it out is the safer move.

export ETCDCTL_API=3

# Check etcdctl is visible and confirm the version
etcdctl version

If etcdctl isn’t present, run it inside the etcd container or install it as a package. On a kubeadm cluster’s control plane node it’s usually already there.

Step 1: Save the snapshot (snapshot save) #

Now we take the snapshot. You have to pass the endpoint and all three certificates. If even one is missing, the command won’t go through because of an authentication failure.

ETCDCTL_API=3 etcdctl snapshot save /opt/snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Don’t mix up the correspondence between the flag names. The manifest’s --trusted-ca-file is etcdctl’s --cacert, --cert-file is --cert, and --key-file is --key. The manifest side uses the flags the etcd server takes, while the etcdctl side uses the flags the client takes, so the names differ.

When the save finishes, Snapshot saved at /opt/snapshot.db is printed. Check that the snapshot was taken properly.

ETCDCTL_API=3 etcdctl snapshot status /opt/snapshot.db --write-out=table

It’s fine if the hash, key count, and DB size come out in a table like this.

+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| fe01cf57 |    24390 |       1287 |     5.4 MB |
+----------+----------+------------+------------+

snapshot status doesn’t need certificates, because it only reads a file you’ve already taken. Conversely, snapshot save connects to the live etcd and pulls data from it, so it needs both the endpoint and the certificates.

Step 2: Restore the snapshot (snapshot restore) #

Restoring is the task of unpacking the data from the snapshot file into a new data-dir. It’s safest to restore into a separate directory rather than overwriting the existing data-dir, then change the manifest so etcd points at that directory.

ETCDCTL_API=3 etcdctl snapshot restore /opt/snapshot.db \
  --data-dir=/var/lib/etcd-restore

There’s an important point here. snapshot restore doesn’t connect to a live etcd — it’s an offline task that unpacks data from a file into a local directory. So it needs no --endpoints or certificate flags. Some people reflexively attach certificates in the exam; with restore that’s harmless but essentially unnecessary. What you actually need is just --data-dir.

When the command finishes, a member directory is created inside /var/lib/etcd-restore, with the restored etcd data inside it.

Step 3: Make etcd point at the restored directory #

Restoring alone doesn’t bring the cluster back, because the live etcd is still pointing at the old data-dir (/var/lib/etcd). You have to edit the manifest so the etcd static Pod looks at the new directory. There are two places to fix.

First, open /etc/kubernetes/manifests/etcd.yaml. The data-dir appears in two spots: the container args and the hostPath volume.

The container arg’s --data-dir can be left as is, because that value is a path inside the container. The place you actually change is the volume’s hostPath that mounts the host directory into the container.

  volumes:
  - hostPath:
      path: /var/lib/etcd-restore   # changed from the original /var/lib/etcd
      type: DirectoryOrCreate
    name: etcd-data

The name: etcd-data volume is mounted at the path --data-dir points to inside the container (/var/lib/etcd). So if you change only the hostPath to the restore directory, etcd inside the container still sees data at the same path (/var/lib/etcd), but from the host’s perspective it reads the restored directory. Leaving the volumeMounts mountPath and the container arg --data-dir untouched is the simplest approach.

Step 4: Restart etcd and verify #

When you save the manifest, the kubelet detects the change and recreates the etcd Pod. Since it’s a static Pod, you don’t need a command like kubectl apply. Just editing the file makes the kubelet kill the old container and bring it back up with the new args.

Once you save the manifest, the kubelet brings up the new etcd a moment later. If the change is slow to take effect, you can force a reload by briefly moving the manifest out of the directory and putting it back.

# When a forced reload is needed
mv /etc/kubernetes/manifests/etcd.yaml /tmp/etcd.yaml
# After a moment
mv /tmp/etcd.yaml /etc/kubernetes/manifests/etcd.yaml

Once etcd comes up with the new data, the apiserver reconnects to etcd and the cluster state returns to the moment of the snapshot. Check that the container is up and that the apiserver responds.

# etcd container status via crictl
crictl ps | grep etcd

# Trace the restart via kubelet logs
journalctl -u kubelet -f

# If the apiserver is back, the cluster responds
kubectl get nodes
kubectl get pods -A

Once the kubectl get commands show the resources from the moment of the snapshot, the restore is complete.

When etcd is external #

Up to here, the assumption has been the stacked etcd covered in #4 (a setup where etcd runs as a static Pod on the control plane node). The external etcd cluster from #5 is different. Because etcd runs as a systemd service on a separate node, you proceed not by editing a manifest but with the flow systemctl stop etcd 〜 restore 〜 change the data-dir in the etcd config file 〜 systemctl start etcd. The save/restore commands themselves are identical.

Traps #

The patterns that make people get this task wrong on the CKA hands-on exam are pretty much fixed.

  • Missing ETCDCTL_API=3: if it runs as v2, the snapshot subcommand itself gets interpreted differently and fails. Prefix the command with ETCDCTL_API=3 or fix it with export.
  • Missing certificate flags: snapshot save requires the endpoint and all three of the cacert/cert/key certificates. Drop even one and it fails on an authentication error. Don’t memorize the paths — confirm them in the manifest.
  • Trying to pass certificates to restore: conversely, snapshot restore is a file-based offline task, so it needs no endpoint or certificates. The key is not to drop the --data-dir that you actually need.
  • Not reflecting the data-dir after restore: this is the most common mistake. If you only restore and don’t change the manifest’s hostPath to the new directory, etcd keeps looking at the old data, so the cluster isn’t recovered.
  • Confusing flag names: people mix the server-side --trusted-ca-file/--cert-file/--key-file with the client-side --cacert/--cert/--key. With etcdctl you use the client names.
  • Not checking the save path: if you save to a place different from the snapshot path the question specifies (e.g., /opt/snapshot.db), grading misses it. Confirm the saved file with snapshot status.

Exam points #

  • etcd backup and restore is a regular fixture on the CKA exam, and the accuracy of the commands decides your score.
  • Backup: ETCDCTL_API=3 etcdctl snapshot save <file> --endpoints --cacert --cert --key. Confirm all certificates in the manifest and plug them in.
  • Restore: ETCDCTL_API=3 etcdctl snapshot restore <file> --data-dir=<new directory>. No endpoint or certificates needed; --data-dir is required.
  • After restore, you have to change the hostPath in /etc/kubernetes/manifests/etcd.yaml to the new data-dir so the kubelet restarts etcd with the new data.
  • Confirm the integrity of the snapshot you took with snapshot status --write-out=table.
  • For external etcd, you stop/start with systemd instead of the manifest, and the save/restore commands are the same.

Wrap-up #

What this post locked in:

  • etcd is the single store of cluster state. An etcd backup is a cluster backup.
  • The three backup essentials: ETCDCTL_API=3, the endpoint (https://127.0.0.1:2379), and the three certificates. The paths are read from /etc/kubernetes/manifests/etcd.yaml.
  • The restore flow: restore to a new data-dir with snapshot restore → change the manifest hostPath → the kubelet restarts etcd → the apiserver recovers.
  • Traps: missing API version, missing certificate flags, not reflecting the data-dir after restore.

Now, even if you lose etcd, a single snapshot lets you recover the cluster to a point in time. The other thing to protect alongside a backup is certificates. Even with etcd data alive, if the certificates expire, the apiserver and etcd no longer trust each other and communication is cut off.

Next — Certificate Management #

In #8 Certificate Management: PKI, kubeconfig, Certificate Renewal, we’ll lay out the PKI structure kubeadm builds (a certificate chain with ca.crt at its apex), how a kubeconfig file holds authentication information, and the procedure for renewing an expired certificate with kubeadm certs renew. In CKA troubleshooting, certificate expiry often shows up as the cause of the apiserver dying entirely, so we’ll drill the skill of checking whether something has expired and bringing it back to life.

X