Deployment and ReplicaSet
Cover declarative deployment and rolling updates. Build the relationship among the three tiers Deployment / ReplicaSet / Pod, self-healing with replicas: 3, RollingUpdate's maxSurge / maxUnavailable, rollout undo rollback, and the workloads Deployment doesn't solve (StatefulSet · DaemonSet · Job) — all together.
The one line we confirmed at the end of Chapter 3 kubectl and your first Pod — a Pod is mortal, so if you bring it up directly it merely disappears — becomes the starting point of this chapter. Here we cover the first controller that fills that gap automatically, the Deployment, and the ReplicaSet beneath it. We cover how replicas: 3 keeps Pods running, how auto-healing works when you delete one Pod, and how rolling updates and rollbacks work when you change the image tag.
By the end of this chapter you’ll have your first manifest that leaves Pods to a controller instead of bringing them up by hand. From here on, this is essentially the base form of the manifests people write for operations.
Deployment, ReplicaSet, Pod — the relationship of three tiers #
The stars of this chapter are three resources, but you write only the top layer. It helps to fix the picture in your head like this.
┌──────────────────────┐
│ Deployment │ ← the manifest a person writes
│ (web) │
└──────────┬───────────┘
│ creates / manages
▼
┌──────────────────────┐
│ ReplicaSet │ ← Deployment creates it automatically
│ (web-abc123) │
└──────────┬───────────┘
│ creates / maintains
▼
┌──────────┬──────────┬──────────┐
│ Pod │ Pod │ Pod │ ← the actual workload
│ web-... │ web-... │ web-... │
└──────────┴──────────┴──────────┘Organizing each tier’s responsibility in one line gives this.
- Deployment — the manifest you write. It says, “bring up N copies of this Pod template, and switch to a new version with a rolling update.” It is the resource you touch most often in operations.
- ReplicaSet — the intermediate object a Deployment creates automatically. Its responsibility is simple: always maintain N copies of this Pod template. People rarely write a ReplicaSet manifest directly.
- Pod — the actual workload. The ReplicaSet creates it, and when it dies the ReplicaSet creates it again. It is the same Pod we brought up by hand in Chapter 3, but now it comes back on its own no matter who kills it.
Why two layers #
At first glance, one Deployment layer seems enough. You may ask why ReplicaSet exists separately. The reason is version rollout.
When deploying a new version, the Deployment creates one more new ReplicaSet. It raises that new ReplicaSet’s replicas gradually to 1, 2, 3 while lowering the old ReplicaSet’s replicas gradually to 3, 2, 1. In between there’s a short stretch where both ReplicaSets’ Pods are up together — this is the body of the rolling update. When done, the old ReplicaSet sits empty at 0 but remains as an object, so when a rollback is needed you revert by growing that one back to N.
In short, a Deployment is the layer that handles the transition between versions and a ReplicaSet is the layer that maintains one version at N. Because the two are split, the old version and the new version can briefly coexist in the same cluster.
Your first Deployment manifest #
This time write the same nginx:1.27 as a Deployment rather than a Pod. Name the file web.yaml and write it as follows.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
labels:
app: web
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: web
image: nginx:1.27
ports:
- containerPort: 80Compared with the Pod manifest from Chapter 3, there are three newly arrived parts.
apiVersion: apps/v1— Pod wasv1, but Deployment is in theapps/v1API group. The controller-family resources (Deployment, StatefulSet, DaemonSet, ReplicaSet) all belong to the same group.spec.replicas: 3— a declaration that 3 of this Pod template must always be up.spec.selector.matchLabels+spec.template— the label condition by which the Deployment finds the Pods it manages, and the template of what shape those Pods should be. The shape undertemplateis exactly the same as the Pod’smetadata+specwe saw in Chapter 3.
One rule — the selector and template labels must match #
The mistake that blows up most often when first writing a manifest is here. spec.selector.matchLabels and spec.template.metadata.labels must match each other. If they don’t match, K8s rejects the manifest. It’s not a mere recommended convention but a validation rule the apiserver holds.
That’s why both are set to app: web in the manifest above. If you set the selector to app: web but changed only the template’s label to app: nginx, kubectl apply spits out an error like the following.
The Deployment "web" is invalid: spec.template.metadata.labels:
Invalid value: map[string]string{"app":"nginx"}:
`selector` does not match template `labels`A simple model in your head is this — the selector is “how I recognize the Pods I manage,” and the template is “the labels of the Pods I create.” The near-tautological rule is that the two must match so I recognize the Pods I created.
Applying it #
Reflect web.yaml into the cluster.
kubectl apply -f web.yamldeployment.apps/web createdLet’s see the three kinds of resource at once. kubectl get can take several resource kinds at once, separated by commas.
kubectl get deploy,rs,podsNAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/web 3/3 3 3 20s
NAME DESIRED CURRENT READY AGE
replicaset.apps/web-abc123 3 3 3 20s
NAME READY STATUS RESTARTS AGE
pod/web-abc123-aa11 1/1 Running 0 20s
pod/web-abc123-bb22 1/1 Running 0 20s
pod/web-abc123-cc33 1/1 Running 0 20sHow to read it is as follows.
- Deployment line —
READY 3/3means all 3 desired are ready,UP-TO-DATE 3is the number of Pods updated to the current template, andAVAILABLE 3is the number of Pods that have stayed alive through minReadySeconds and are fit to take traffic. - ReplicaSet line — the name is the
web-abc123form. The trailingabc123is a random value K8s auto-generates as a template hash.DESIRED 3 / CURRENT 3 / READY 3are the key columns. - Pod line — the name carries a two-stage random value like
web-abc123-aa11. You can see the front part (web-abc123) matches the ReplicaSet name. Who created it is shown right in the name.
Pinning the name pattern in one line — <deployment>-<replicaset-hash>-<pod-suffix>. You meet this shape often to the end of the book.
Killing a Pod — self-healing #
It’s time to confirm this manifest’s first effect. In Chapter 3, deleting a Pod just made it disappear. Now let’s confirm how it’s different.
kubectl delete pod web-abc123-aa11pod "web-abc123-aa11" deletedImmediately get the Pod list again.
kubectl get podsNAME READY STATUS RESTARTS AGE
web-abc123-bb22 1/1 Running 0 2m
web-abc123-cc33 1/1 Running 0 2m
web-abc123-dd44 1/1 Running 0 5sThere are still three up. But looking at one line closely shows a difference — bb22 and cc33 are AGE 2m, while the newly seen dd44 is AGE 5s. It’s a Pod just brought up fresh. The changed random value after the name is also a clue that it’s a new Pod.
This is the reconcile loop we saw as a diagram in Chapter 1 What Kubernetes Is at work. The ReplicaSet holds “3 Pods must exist,” and the moment a person deleted one, desired (3) and actual (2) diverged. The ReplicaSet controller inside the controller-manager detects that difference and asks the apiserver to create one more Pod. The scheduler decides a node, the kubelet brings up the container, and it’s back to 3. A person did nothing extra.
The same thing happens at the node level. When a node that some Pods were on dies, K8s moves those Pods to another live node and brings them up again. “The service must stay alive even when a node dies,” which we saw in Chapter 1, is actually the problem this ReplicaSet controller solves.
Adjusting replicas #
When 3 is too few or too many, there are two paths to adjusting the count.
Declarative — change the number in the manifest and apply again. The cleanest path.
spec:
replicas: 5
...kubectl apply -f web.yamldeployment.apps/web configuredLooking at kubectl get pods, it’s soon grown to 5. Shrinking is the same way — reduce the number in the manifest and apply.
Imperative — fast but temporary.
kubectl scale deploy/web --replicas=5deployment.apps/web scaledIt’s light for momentarily growing and shrinking. But the downside is clear — the manifest’s replicas value and the cluster’s actual state diverge. The manifest still says replicas: 3 while the cluster has 5 up. The next time someone calls kubectl apply -f web.yaml once more without much thought, the 5 shrink back to 3.
So, organizing it in one line — the declarative manifest is always the source of truth. Use kubectl scale only when you need to touch something up quickly while debugging, or when you intend to re-sync the manifest soon, and the normal flow is to fix the manifest and apply. This principle is the foundation of the entire desired state model we saw in Chapter 1, and it’s covered more seriously once more with ArgoCD / Flux in Chapter 20 GitOps.
The option to grow and shrink replicas automatically with load instead of a person deciding is covered in Chapter 13 Autoscaling with HPA · VPA · Cluster Autoscaler.
Rolling update — the default behavior of zero-downtime deploys #
Now, for the first time in the book, we deploy a new version. Change the image tag from nginx:1.27 to nginx:1.28 — just one character.
containers:
- name: web
image: nginx:1.28
ports:
- containerPort: 80kubectl apply -f web.yamldeployment.apps/web configuredThe message that drops on the surface is one line, but inside, quite a lot happens.
What happens inside #
The Deployment controller sees the template changed and creates a new ReplicaSet for that new template. Then it grows the new RS’s replicas from 0 to 1, 2, 3, and shrinks the old RS’s replicas from 3 to 2, 1, 0. At each step a new Pod must come into Ready state before it moves to the next step.
If you look with kubectl get rs mid-progress, two ReplicaSet lines show.
kubectl get rsNAME DESIRED CURRENT READY AGE
web-abc123 2 2 2 10m ← old RS (1.27)
web-def456 2 2 1 30s ← new RS (1.28)The old RS has shrunk from 3 to 2, and the new RS has risen from 0 to 2. This one frame is the body of the rolling update. The two RS’s Pods are briefly up together — traffic during that time is distributed evenly by the Service we’ll cover in Chapter 5 Service.
Monitoring progress #
If you want to see rollout progress in one line, the following command is the most convenient.
kubectl rollout status deploy/webWaiting for deployment "web" rollout to finish: 1 out of 3 new replicas have been updated...
Waiting for deployment "web" rollout to finish: 2 out of 3 new replicas have been updated...
Waiting for deployment "web" rollout to finish: 1 old replicas are pending termination...
deployment "web" successfully rolled outProgress steps drop on the screen line by line, and when success prints at the end, the new-version deploy is done. Looking at kubectl get rs again afterward shows the old RS empty at DESIRED 0 but remaining as an object. This structure makes the rollback in the next section possible.
The default strategy, in one line #
The exact reason the flow above happens is that a Deployment’s spec.strategy default is RollingUpdate, and its two parameters are as follows.
maxSurge: 25%— the limit on how many more can be temporarily up beyond the desired count. With 3 as the base, up to 1 extra is allowed.maxUnavailable: 25%— the limit on how many can be temporarily short of the desired count. With 3 as the base, up to 1 short is allowed.
The other option is Recreate — kill all old Pods and bring up new ones. It isn’t zero-downtime, but it’s sometimes used for stateful workloads where two versions must not be up at once (e.g., a DB migration occupying the same volume). For a regular web server the default RollingUpdate is enough.
What happens on failure #
Let’s deliberately write the image tag wrong — say something like nginx:1.99-not-real.
kubectl apply -f web.yamlkubectl rollout status deploy/web stays stuck for a long time, and looking with kubectl get pods shows one newly created Pod in ImagePullBackOff state.
NAME READY STATUS RESTARTS AGE
web-abc123-aa11 1/1 Running 0 15m
web-abc123-bb22 1/1 Running 0 15m
web-abc123-cc33 1/1 Running 0 15m
web-ghi789-zz99 0/1 ImagePullBackOff 0 40sThe interesting part is that the 3 old Pods are still alive. If the new Pod can’t come into Ready, the Deployment doesn’t move to the next step. That is, it doesn’t shrink the old RS to 0. Even while the rollout is stuck, the old version takes traffic normally. The heart of zero-downtime is here.
The debugging order at this point is exactly as organized at the end of Chapter 3.
kubectl describe deploy/web
kubectl describe pod web-ghi789-zz99describe deploy’s Events have a message like ReplicaSet ... has timed out progressing, and describe pod’s Events have Failed to pull image "nginx:1.99-not-real". The answer is almost always in these two outputs. The finished version of the diagnostic tree is organized in Chapter 27 kubectl debugging patterns.
Rollback #
If you find a new version went up wrong, the path to revert to the old version is prepared in one line.
kubectl rollout history deploy/webdeployment.apps/web
REVISION CHANGE-CAUSE
1 <none>
2 <none>A revision list shows. Number 1 is the initial nginx:1.27, and number 2 is the nginx:1.28 we just put up. To revert to the immediately previous revision, do this.
kubectl rollout undo deploy/webdeployment.apps/web rolled backIf you want to specify a particular revision, use the --to-revision flag.
kubectl rollout undo deploy/web --to-revision=1The reason this is possible is summed up in one line — a revision is another guise of a ReplicaSet. When deploying the new version, the old ReplicaSet didn’t disappear; it stayed emptied to replicas: 0. undo is the act of growing that old RS back to N. So the old version takes traffic again almost instantly.
By default K8s keeps up to 10 revisions. You can grow or shrink that with the manifest’s spec.revisionHistoryLimit. Set it too long and old ReplicaSets pile up in the registry making cleanup messy; set it too short and you can’t jump back to a much older version at once — so match it to the deploy frequency of your operational environment. For a typical web service the default 10 is fine.
What Deployment doesn’t solve #
The Deployment we’ve covered so far doesn’t handle every workload shape. Let’s note the cases of a different kind in one section.
- Stateful workloads — workloads where each instance must have its own name and own disk, like a database, fit
StatefulSet. Pod names are assigned stably asweb-0,web-1, and each is connected 1:1 to a PVC defined in the manifest. The start order is also guaranteed as0→1→2. Deployment handles both Pod names and disks as random values, so it’s not suitable for a DB. - Workloads that must run one per node — workloads like log collectors (Fluent Bit, Filebeat), node monitors (Node Exporter), and CNI agents fit
DaemonSet. When a new node joins the cluster, one comes up on that node automatically. - One-off jobs — a task that runs once and finishes, like a migration, backup, or batch job, uses a
Job(run immediately) orCronJob(scheduled). It’s a workload where a Pod going into PhaseSucceededis natural.
These three are covered in Chapter 8 StatefulSet / DaemonSet / Job / CronJob. This chapter focuses only on the most-touched Deployment. Still, it’s good to fix the classification in your head up front — no state, Deployment; state, StatefulSet; one per node, DaemonSet; one-off, Job.
Cleanup and teardown #
Wipe today’s resources clean. Deleting one Deployment cleans up the ReplicaSet and Pods beneath it together. This is the part where K8s garbage-collects through the parent-child relationship (owner reference).
kubectl delete -f web.yamldeployment.apps "web" deletedkubectl get deploy,rs,podsNo resources found in default namespace.There’s also the path of deleting directly by name.
kubectl delete deploy webConfirm that deleting only the Deployment makes the ReplicaSet and Pods disappear all at once. This owner reference model works the same way later in the book too.
Exercises #
- After bringing up
web.yamlwithreplicas: 3as in the body, force-delete one Pod withkubectl delete pod <name>. Record in time order how theAGEcolumn ofkubectl get podschanges, and note how the random value after the new Pod’s name changed. Write a paragraph about where the ReplicaSet controller’s reconcile loop filled the difference. - After rolling out once from
nginx:1.27tonginx:1.28, deliberatelyapplyonce more with a nonexistent tag likenginx:1.99-not-real. Organize howkubectl get rsandkubectl get podsget stuck, and how the 3 old Pods remaining in a state that can take traffic connects to the zero-downtime heart of §“What happens on failure.” Record the full flow up to reverting cleanly withkubectl rollout undo deploy/webas one pass. - After growing imperatively with
kubectl scale deploy/web --replicas=5, re-runkubectl applyof the manifest (replicas: 3) once more and confirm the 5 shrink back to 3. Note in a paragraph how the conclusion of §“Adjusting replicas,” “the declarative manifest is always the source of truth,” connects to the mental model of Chapter 20 GitOps.
In one line: a Deployment is the manifest that handles “maintain N of this Pod template + rolling transition to a new version,” the ReplicaSet beneath it takes responsibility for maintaining N of one version, and a Pod takes responsibility for the actual execution. Even if you delete a Pod, the ReplicaSet brings up a new one and self-healing works automatically. A new-version deploy is a rolling update that gradually grows the new RS and gradually shrinks the old RS; on failure the old RS keeps taking traffic, and one line of
kubectl rollout undoreverts by growing the old RS again.
Next chapter #
Even this far, one thing is still unsolved — how do you send traffic from outside into a Pod inside the cluster? The 3 nginx Pods we just created merely have cluster-internal IPs, and those IPs change every time a Pod dies and comes back up. The ReplicaSet revives Pods automatically, but since the IP of a Pod revived that way differs each time, where the client should connect is ambiguous.
In Chapter 5 Service we cover how a Service puts a stable virtual IP / DNS name in front of Pods, the difference between the three kinds — ClusterIP for cluster-internal communication, NodePort that opens externally via a node port, and LoadBalancer that attaches a cloud load balancer — and the flow of attaching a Service in front of the app: web Pods we brought up in this chapter to make your first external connection. The 3 Pods of this chapter become a “service with an address” for the first time there.