Contents
30 Chapter

Upgrade Strategy

The last chapter of Part 5. An operations manual for safely keeping up with Kubernetes minor releases (14 months of support). It covers the order control plane → data plane (nodes) → add-ons, deprecated API detection (pluto · kubent · apiserver metric), the API-version migration of manifests / Helm / Operator CRs, the node group / Karpenter NodePool drift flow of EKS, the safety devices of node drain (PDB · terminationGracePeriodSeconds), minimizing the blast radius, rollback scenarios, choosing a backup per RPO / RTO, and the checklist for the week before, the day of, and the week after the upgrade.

This is the last chapter of Part 5 (Operations · Debugging · Cost). If Chapter 27, kubectl Debugging Patterns dealt with incidents, Chapter 28, Cost Optimization with the bill, and Chapter 29, Secret Operations with security, this chapter deals with time. Kubernetes ships a minor version every quarter, and EKS’s standard support period is 14 months. To keep a production cluster within standard support, at least one minor upgrade a year is an essential cycle, and the manual for running that cycle safely is the main text of this chapter.

The flow briefly pointed at in Chapter 26, The Operations Checklist §“EKS upgrade” unfolds here. The goal of this chapter is to make an operational model where the checklist for the week before, the day of, and the week after the upgrade fits on one page, and one minor upgrade per quarter runs without incident.

The K8s release cycle — a year’s timetable #

We restate K8s’s own version policy.

the K8s minor release cycle
release cadence:        about 4 months
support period (upstream): 12 ~ 14 months
EKS standard support:    14 months
EKS extended support:    an additional 12 months (paid)

deprecated marking:     one minor release ahead
removed point:          2 ~ 3 minor releases later

From this timetable, two signals come into a production cluster.

  • deprecated notice — at one minor release, “this API will be removed soon” is shown. It’s usually actually removed 2 ~ 3 minors later (8 ~ 12 months later).
  • standard support expiry — the EKS console shows “this cluster’s standard support expires in X months.” From that point new patches stop coming in, and moving to extended support adds cost.

The standard for a production cluster is to put these two signals onto a quarterly calendar in advance. The 8 ~ 12 months between deprecated → removed is the time window in which you can clean up manifests. Miss this window and the manifest is rejected on the new cluster, causing downtime.

The principle of upgrade order #

A K8s upgrade must follow a fixed order.

the standard order of an upgrade
1. control plane -- API server, controller manager, scheduler, etcd
2. data plane (node) -- kubelet, kube-proxy, container runtime
3. add-ons -- VPC CNI, CoreDNS, kube-proxy, EBS CSI, Karpenter, ArgoCD, ...

The reason for this order is K8s’s compatibility policy.

  • The kubelet’s version must be within the API server’s same minor ± 1 minor.
  • kube-proxy’s version must match the kubelet’s minor.
  • Add-ons have a compatibility matrix according to the K8s minor version.

Because of this compatibility, the order control plane → data plane is forced. A state where you’ve raised the control plane to 1.32 and the nodes stay at 1.31 for a while is normal (skew 1), but a state where you raise the nodes to 1.32 first and the control plane is 1.31 is not allowed.

One minor at a time — the reason skipping is forbidden #

EKS can be upgraded one minor at a time only. A direct jump of 1.30 → 1.32 is impossible; you must go through the two steps 1.30 → 1.31 → 1.32.

The reasons for this constraint are the following two.

  • The safety of etcd migration — etcd’s data format may change in part with each minor. Step-by-step migration is safe.
  • API conversion — each minor has a set of API versions it can convert. Skipping two minors at once can produce objects that can’t be converted.

If a production cluster is near standard support expiry and is two minors behind, you have to finish two upgrades within one quarter. Postpone once and the next quarter becomes double the work.

Deprecated API detection — two places, manifest and cluster #

The biggest risk of an upgrade is an old API in the manifest. You have to check both places.

1. Manifest — pluto #

pluto — static analysis of manifests
pluto detect-files -d charts/ --target-versions k8s=v1.32
pluto detect-helm --target-versions k8s=v1.32

It scans all the YAML in charts/ to find APIs deprecated or removed in 1.32. Scanning just the one manifest repo of Chapter 20, GitOps catches all environments’ deprecations at once — one of the operational values of the git single-source model.

2. Cluster — kubent #

kubent — the deprecated actually live on the cluster
kubent --target-version 1.32 --context myshop-prod

kubent (kube-no-trouble) checks the cluster’s actual state. You may have cleaned up the manifests, but an object someone made directly in the console, or an object created by an Operator, may still have a deprecated API.

The combination of the two tools secures a state where deprecated has disappeared from “both my manifest and my cluster.”

3. apiserver metric — the last signal #

real-time tracking of deprecated API calls
apiserver_requested_deprecated_apis

If this metric is not 0, it’s a signal that someone (or some component) is still calling a deprecated API. The Prometheus of Chapter 25, Monitoring · Alerts scrapes this metric, so adding it as a quarterly checkup rule is the standard.

PrometheusRule — alert for deprecated API calls
- alert: K8sDeprecatedApisInUse
  expr: |
    sum(rate(apiserver_requested_deprecated_apis[7d])) > 0
  for: 1h
  labels:
    severity: warning
    team: platform
  annotations:
    summary: "A deprecated K8s API is being called"
    runbook_url: "https://runbooks.myshop.example.com/k8s-deprecated"

If this alert is quiet on a normal day, the manifest cleanup for the next upgrade is nearly done.

API-version migration patterns #

How to move the API version of an object where a deprecation was found is the next step.

Converting the manifest directly #

kubectl convert — old API -> new API
kubectl convert -f old.yaml --output-version networking.k8s.io/v1

kubectl convert was built in up to K8s 1.14 and is a separate plugin afterward. APIs that can be auto-converted are solved in one go by this command.

The API version of a Helm chart #

The API versions written inside a Helm chart’s templates are solved by a new release of the chart itself. A dependency chart (e.g., ingress-nginx, cert-manager) has to be raised to a new version that supports K8s 1.32.

checking a Helm chart for outdated
helm outdated -n monitoring   # a separate plugin in some environments
helm list --all-namespaces

The myshop-api chart of Chapter 22, The App Deployment Skeleton is this book’s standard manifest, so K8s compatibility is guaranteed up to a certain point, but for external charts (e.g., aws-load-balancer-controller, external-secrets, kube-prometheus-stack) you have to check before the upgrade whether the new version supports K8s 1.32.

The API version of an Operator CR #

The CRD itself of Chapter 18, CRD and Operator also has versions (v1alpha1v1beta1v1). When a new version of the Operator defines a new version of the CRD, you have to migrate old CRs to the new format. A conversion webhook is the tool that automates that conversion, but not every Operator implements it — checking the upgrade notes of the Operator you’ve introduced in advance is the standard.

EKS’s upgrade flow #

1. Control plane — one line of Terraform #

terraform — cluster_version
module "eks" {
  # ...
  cluster_version = "1.32"   # 1.31 -> 1.32
}

terraform apply triggers the minor upgrade of the EKS control plane. EKS upgrades the control plane with no downtime — it takes about 30 minutes ~ 1 hour. During that time user workloads are unaffected and kubectl keeps working.

That said, some actions briefly pause during that time — API calls like creating new objects and changing scale are temporarily delayed. At the time of the prod control plane upgrade, deliberately stopping deploys is safe.

2. Data plane — two patterns #

EKS node upgrades were pointed at in Chapter 26 as two patterns — in-place rolling and blue-green.

Managed Node Group in-place has EKS automatically run the following cycle.

Managed NG in-place rolling
1. create a new launch template (new AMI)
2. ASG's desired capacity + 1 -> one new node joins
3. cordon one old node -> drain (move Pods)
4. remove the old node from the ASG
5. repeat with the next node

This cycle takes roughly the number of nodes × 5 ~ 10 minutes. One node-group upgrade of a 10-node cluster is about 1 hour ~ 1.5 hours.

Karpenter NodePool drift is the automatic renewal mechanism. When part of the NodePool’s template.spec.requirements changes (e.g., when a new AMI is reflected), Karpenter gradually replaces old nodes with new ones — a mechanism called drift detection. The result is similar to a managed node group’s in-place update, but it proceeds by Karpenter’s decision.

the disruption policy of a Karpenter NodePool
spec:
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
    budgets:
      - nodes: "10%"   # replace at most 10% of nodes at a time
        duration: 10m
        schedule: "0 9 * * mon-fri"   # weekdays from 9 a.m. only

budgets is the tool that controls the upgrade’s blast radius. Putting it at at most 10 % at a time + weekday business hours only is prod’s standard pattern.

3. Add-ons — the compatibility matrix #

terraform — the add-on version matrix
cluster_addons = {
  vpc-cni = {
    addon_version = "v1.18.1-eksbuild.3"   # 1.32 compatible
  }
  coredns = {
    addon_version = "v1.11.1-eksbuild.9"
  }
  kube-proxy = {
    addon_version = "v1.32.0-eksbuild.2"
  }
  aws-ebs-csi-driver = {
    addon_version = "v1.30.0-eksbuild.1"
  }
}

AWS’s EKS Addon page organizes the compatible version per K8s minor as a matrix. Even with most_recent = true, EKS automatically picks the compatible latest version, but in an operational environment explicit version pins + manual updates each quarter is safer for change control.

The VPC CNI of Chapter 15, CNI in Depth is the trickiest add-on — it’s responsible for the Pod’s network IP allocation, so a bad upgrade leaves new Pods unable to get an IP. The standard is to run it in dev for a week first and then reflect it into prod.

The safety devices of node drain #

The best-known incident of an upgrade is downtime during drain. Three safety devices must combine to be safe.

1. PodDisruptionBudget (PDB) #

This is the point where the one PDB line built in Chapter 22, The App Deployment Skeleton becomes active.

PDB — the availability floor for drain
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: myshop-api
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app.kubernetes.io/name: myshop-api

minAvailable: 2 is the key — on drain, the number of Pods that can be evicted simultaneously is limited to current ready - 2. Since EKS’s drain respects the PDB, if drain can’t finish within 5 minutes, EKS raises an alert and waits for a human decision.

Like the ResourceQuota of Chapter 14, RBAC / NetworkPolicy / ResourceQuota, a PDB is also a kind of declarative guardrail. Write one line in the manifest and it automatically works as the safety line for an upgrade a year later.

2. terminationGracePeriodSeconds #

Deployment — the graceful shutdown time
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 60   # default 30
      containers:
        - name: api
          # ...

It’s the grace period from the SIGTERM the Pod receives on drain to the SIGKILL. The content covered in Chapter 12, Health Checks §“graceful shutdown” carries over here into a real operational viewpoint.

The operational standard is to set 30 ~ 60 seconds as the time for myshop-api to finish in-progress HTTP requests, close DB connections, and wrap up in-flight work. Too short and a 500 response falls on the user; too long and drain itself stalls.

3. preStop hook #

preStop — securing time to force readiness false
containers:
  - name: api
    lifecycle:
      preStop:
        exec:
          command: ["sh", "-c", "sleep 10"]

The preStop hook runs before SIGTERM. During the 10 seconds of sleep, readiness becomes false so it’s excluded from the Service’s endpoints, and the new requests that were coming in during that time go to other Pods. It’s a decisive step for separating in-progress requests from new requests.

The danger of --disable-eviction #

don't use this
kubectl drain <node> --disable-eviction --force

This option ignores the PDB and forcibly evicts. It’s an option you must never use in operations. The reason a PDB blocked is usually legitimate, and a forced drain becomes the direct cause of downtime.

Minimizing the blast radius #

The standard pattern for reducing an upgrade’s impact range is the following.

minimizing the blast radius
1. run it in dev for a week
   - upgrade the control plane + nodes + add-ons all
   - observe behavior for a week

2. simulate the prod scenario in staging
   - load testing (k6, locust)
   - one pass including DB connection, external API calls

3. upgrade the prod control plane (almost no workload impact)

4. start from a canary portion of the prod node group
   - make a new node group small and move some workloads
   - observe for a week then gradually expand

5. align all node groups + add-ons at the end

The goal of operations is for this flow to run once per quarter. Not upgrading every environment on the same day at once is the biggest safety line.

Rollback scenarios #

It’s the set of options available if the upgrade fails.

Control plane — downgrade impossible #

A minor downgrade of the EKS control plane is impossible. Once raised to 1.32, you can’t lower it to 1.31. Because of this constraint, validating sufficiently in dev / staging before the control plane upgrade is essential.

If you reach a point where a downgrade is needed, there are two alternatives.

  • Make a new cluster (old version) and migrate the workloads — thanks to the manifest single-source model of Chapter 20, GitOps, applying the same manifest to the new cluster brings the workloads up unchanged. Stateful systems like RDS are left alone.
  • Keep the old version in extended support mode — cost is added, but you buy time.

Nodes — return to the old node group #

blue-green rollback of node groups
1. keep the old version's node group in advance (old launch template)
2. find a problem in the new node group
3. migrate workloads to the old nodes by adjusting the Karpenter NodePool or taints
4. empty the new node group

The pattern of keeping the old node group for a week is the node-level rollback safety line. The key point is that even though the control plane can’t be lowered, the nodes can be returned to the old version.

Add-ons — the explicit version pin #

The rollback of an add-on is the simplest. Just revert Terraform’s addon_version to the old value and apply. That’s why explicit version pins were recommended earlier in this chapter’s §“Add-ons.”

Backup and RPO / RTO #

It’s rare for an upgrade itself to lose data, but to make recovery fast in an incident, backup must be part of the plan too.

the three kinds of backup and their RPO / RTO
- RDS PITR
  RPO: 5 seconds ~ 1 minute (transaction-log based)
  RTO: 5 ~ 30 minutes (instance recovery time)

- Velero (S3 backup)
  RPO: with one backup a day, 24 hours on average
  RTO: cluster reconstruction + restore = 1 ~ 4 hours

- EBS Snapshot
  RPO: depends on the snapshot cadence (usually once a day)
  RTO: new PV provisioning + mount = 10 ~ 30 minutes

- etcd snapshot (self-managed K8s only)
  RPO: the snapshot cadence (usually hourly)
  RTO: K8s control plane restore = 1 ~ 2 hours

In EKS the control plane’s etcd is managed, so the user doesn’t need to back it up directly. So this chapter’s backup targets are the three axes RDS + EBS + Velero. The quarterly recovery drill pointed at in Chapter 26, The Operations Checklist §“Backup and recovery” is bound directly to this section’s RTO validation.

The standard is to confirm right before an upgrade that all three of the following backups are current.

backup checkup right before an upgrade
1. RDS — the currency of automatic snapshots, a manual snapshot recommended on top
2. Velero — the time of the last backup, the success/failure state
3. EBS Snapshot — a manual snapshot of important PVs

The upgrade checklist #

The standard checklist for the week before / the day of / the week after the upgrade.

The week before #

one week before the upgrade
[Preparation]
- review the release notes (deprecated / removed API, breaking changes)
- clean up deprecated APIs with pluto + kubent
- confirm the apiserver_requested_deprecated_apis metric is 0
- confirm new-version compatibility of external charts (LB Controller, ESO, kube-prometheus-stack, etc.)
- check the Operator's CRD migration notes

[Validation]
- upgrade the dev cluster + run it for a week
- staging load testing
- confirm all alert rules still work on the new version

[Materials]
- confirm RDS / Velero / EBS backups are current
- confirm the old node group's launch template is preserved
- change notice (user / internal announcement)
- prepare the Slack incident channel

The day of #

the day of the upgrade
[Order]
1. change freeze (suspend deploys)
2. a last manual snapshot (RDS, EBS)
3. control plane upgrade (Terraform)
4. one hour of monitoring (kubectl, alerts, user metrics)
5. canary node group upgrade (10 ~ 20%)
6. one hour of monitoring
7. the rest of the node groups upgrade (PDB + drain budgets)
8. add-on upgrade (vpc-cni last)
9. lift the change freeze

[Observation]
- the state change of Pods (the Running ratio)
- the 5xx ratio (myshop-api's core SLI)
- P95 latency
- the Ready ratio of nodes
- the healthy count of ALB targets

The week after #

one week after the upgrade
[Validation]
- alert firing frequency (versus the usual)
- cost change (Karpenter / node price change)
- cleanup of remnant resources (old launch template, old nodes)
- whether new deprecated notices appear (preparation for the next quarter)

[Documentation]
- upgrade retrospective (what went well, incident / improvement points)
- update the runbook (patterns applicable next quarter)
- the expected schedule for the next quarter's upgrade

The goal of the operations team is for this checklist to fit on one page. We put this checklist as the real standard for the quarterly upgrade item in the regular operations calendar of Chapter 26.

Exercises #

  1. Minor-upgrade the K8s version of a dev EKS cluster one step. Follow the items of the §“The week before” checklist from the top down, and record the actual time taken for each item. In one paragraph, organize where the difference between the outputs of pluto and kubent (manifest vs the cluster’s actual state) arises, and add an apiserver_requested_deprecated_apis alert rule to Chapter 25, Monitoring · Alerts.
  2. Run an experiment to measure the effect of myshop-api’s PDB and terminationGracePeriodSeconds. Vary the preStop hook’s sleep time across the three values 0 seconds / 10 seconds / 30 seconds, and measure the difference in the 5xx ratio users receive during a node drain (apply load with k6 while you cordon → drain the node). In one paragraph, organize which combination was optimal and how that reason ties to the readiness probe model of Chapter 12, Health Checks.
  3. Apply the one page of §“The upgrade checklist” to your own production (or learning) cluster. Confirm with git history what the most recent upgrade was, and map the items you missed then or the incident pattern that led to a postmortem onto this chapter’s checklist. Add the next upgrade’s schedule and owner to the regular operations calendar of Chapter 26.

In one line: K8s ships a minor version every quarter and EKS standard support is 14 months, so at least one upgrade a year is essential. The order is control plane → data plane → add-ons, you can go only one minor at a time, and cleaning up deprecated APIs is the biggest task. The three tools — pluto for manifests, kubent for the cluster, and the apiserver_requested_deprecated_apis metric for real time — together catch whether something is deprecated. The safety devices of drain are PDB + terminationGracePeriodSeconds + preStop hook, and you never use --disable-eviction. Minimizing the blast radius is the flow dev for a week → staging load → prod canary node group → gradual expansion, and you keep in your head the rollback matrix of the control plane can’t be downgraded · nodes can return to the old group · add-ons revert the version pin. The goal of operations is for the checklist for the week before / the day of / the week after the upgrade to fit on one page.

Next chapter — the Part 6 capstone #

With this chapter, the four chapters of Part 5 (Operations · Debugging · Cost) are wrapped up, and all 30 chapters of this book are complete. In the next chapter — the book’s final chapter — we bind into one project how all the tools of those 30 chapters mesh inside one system.

Chapter 31, Deploying a Fullstack App on EKS deploys the Next.js app of React and the FastAPI app of Modern Python together on one EKS cluster. From the cluster setup of Terraform + Karpenter + IRSA + ALB Controller + ExternalDNS + cert-manager, to the DB integration of RDS + External Secrets + IRSA RDS IAM auth, the per-environment deployment of Helm + ArgoCD ApplicationSet, the observability of Prometheus + OpenTelemetry, and the k6 load test + OpenCost cost estimate — it organizes the work into 13 PRs. How all the tools of Chapters 1 ~ 30 mesh inside one system can be confirmed in this capstone.

Finally, Appendix A — From docker-compose to k8s serves as a migration guide for the entry-level reader and closes the book.

X