K8s Practice #6: Operations Checklist — Upgrades / Backup,Recovery / Cost / Security

Infrastructure Kubernetes EKS Operations Cost

Sunday, May 10, 2026

13 min read

The last post in the K8s Practice series. Through #1 through #5, we automated one flow from bringing up myshop-api to deployment, DB, CI/CD, and monitoring. At this point the cluster is running well, but operating it safely for a year is a different kind of work. K8s releases three minor versions a year, AWS releases new instance types each quarter, and RDS has periodic maintenance windows. This post organizes that recurring operational cycle across four dimensions: EKS upgrades, backup/recovery, cost, and security. Being the last post in the series, it also includes a retrospective of the six Practice posts and the full 26-post K8s track.

This series is K8s Practice, 6 posts.

#1 EKS Cluster Setup — Terraform / eksctl / IRSA / Addons
#2 App deployment skeleton — Deployment / Service / Ingress / Helm
#3 DB integration — RDS / Secrets Manager / External Secrets / connection pool
#4 CI/CD pipeline — GitHub Actions / ECR / ArgoCD
#5 Monitoring/alarming — Prometheus / CloudWatch / Alertmanager
#6 Operations checklist — upgrades / backup,recovery / cost / security ← this post

EKS upgrade — at least one minor version a year #

Kubernetes’ own version policy is clear.

Minor version (1.30 → 1.31 → 1.32) — released about every 4 months
Each minor version’s support period — about 14 months from release
EKS support period — standard support 14 months + extended support additional 12 months (paid)

To keep the cluster within standard support, at least one minor upgrade per year is required. Managing this through a quarterly upgrade calendar is standard practice.

Standard upgrade flow #

One cycle of EKS minor upgrade

1. Review release notes — new version's deprecations, removed APIs
2. Upgrade dev cluster first
3. Run dev for one to two weeks
4. Clean up deprecated APIs in manifests
5. Run upgrade check tools (pluto, kubent)
6. Upgrade prod control plane
7. Upgrade prod node groups (rolling)
8. Upgrade addons (vpc-cni, coredns, kube-proxy, ebs-csi)

The most time-consuming step is step 4 — cleaning up deprecated APIs. Since Kubernetes removes some APIs in each minor version, manifests that reference old APIs will be rejected by the upgraded cluster.

Check tools — pluto and kubent #

pluto — check deprecated APIs in manifests

pluto detect-files -d charts/ --target-versions k8s=v1.31

kubent — check deprecated APIs alive in cluster

kubent --target-version 1.31

Together, the two tools catch deprecated APIs in both manifests and the live cluster. This check is part of the standard upgrade procedure for production clusters.

Control plane upgrade — one Terraform line #

terraform — change cluster_version then apply

module "eks" {
  # ...
  cluster_version = "1.31"   # 1.30 → 1.31
}

terraform apply triggers the EKS control plane minor upgrade. EKS upgrades the control plane with zero downtime — user workloads are unaffected, and kubectl continues to work. The process takes roughly 30 minutes to 1 hour.

Node group upgrade — Rolling vs Blue-green #

There are two patterns of node group upgrade.

Pattern	Model
In-place rolling	Replace nodes in the same Managed Node Group one by one with the new AMI
Blue-green	Create a new Managed Node Group, migrate workloads, then delete the old group

EKS Managed Node Groups default to in-place rolling. EKS automatically runs the cycle of cordon → drain Pods → bring up new node → move Pods to new node → delete old node. The PodDisruptionBudget created in #2 is critical here — without a PDB, all Pods of the same workload could go down simultaneously, causing downtime from the user’s perspective.

Trigger node group upgrade (Terraform or console)

# After control plane is upgraded to 1.31
aws eks update-nodegroup-version \
  --cluster-name myshop-prod \
  --nodegroup-name general \
  --region ap-northeast-2

When the blast radius of an upgrade is a concern in large clusters, the blue-green pattern is safer — create a new node group, gradually migrate workloads via Advanced #4 Karpenter’s disruption control, then drain and delete the old group.

Addon upgrade #

After the control plane and nodes are on the new version, upgrade the addons as well.

terraform — addon version update

cluster_addons = {
  vpc-cni = {
    most_recent = true   # or explicit version
  }
  coredns = {
    most_recent = true
  }
  # ...
}

With most_recent = true, Terraform auto-upgrades to the latest addon versions compatible with the K8s version. When explicit version pinning is needed, use the addon_version field.

Backup and recovery — RDS’s PITR is the first line of defense #

myshop’s data is almost entirely in RDS. The Kubernetes cluster itself can be reconstructed from git, but RDS data is very difficult to recover once lost.

Auto snapshots + PITR #

The single line backup_retention_period = 30 in #3’s Terraform created the following:

Daily auto snapshots — auto backup at backup_window (03:00-04:00 UTC) time
PITR (Point-in-Time Recovery) — recovery to any time in the last 30 days at 1-second granularity

PITR is RDS’s most powerful backup feature. If someone accidentally runs DELETE FROM orders; on prod, you can restore a new RDS instance to the moment just before the incident in roughly 5 minutes and compare the data.

PITR recovery to 30 minutes ago — as new instance

aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier myshop-prod \
  --target-db-instance-identifier myshop-prod-recovery \
  --restore-time 2026-05-11T08:30:00Z \
  --db-subnet-group-name myshop-prod \
  --vpc-security-group-ids sg-xxxxx

This command creates a new instance with data from 30 minutes ago. The standard incident response is to SELECT the undamaged data from that instance and restore it to prod.

Recovery drills — quarterly check #

Seeing automated backups run and feeling reassured is very different from being able to actually recover. Simulating a recovery once a quarter is the standard.

Quarterly recovery drill procedure

1. Recover new instance via PITR (command above)
2. Sample-check data in new instance (row count, recent transactions)
3. Change config of dev environment's myshop-api to point at new instance
4. Function test (create virtual order, query)
5. Delete new instance + document results

If any failure surfaces in this drill, it becomes operational priority #1. “Backups exist” is not the standard — “recovery has been verified” is.

Cluster recovery — Velero #

The Kubernetes cluster’s first line of recovery defense is the GitOps repo, but to preserve dynamic state stored in etcd (e.g., user-created PVCs, ConfigMap updates), Velero is the standard tool.

Velero — backup cluster to S3

velero install \
  --provider aws \
  --bucket myshop-velero-backups \
  --region ap-northeast-2 \
  --plugins velero/velero-plugin-for-aws:v1.10.0 \
  --backup-location-config region=ap-northeast-2

# Daily auto backup
velero schedule create daily --schedule="0 2 * * *" --ttl 720h

Velero periodically backs up Kubernetes objects from etcd along with EBS volume snapshots to S3. When a cluster is lost entirely, you can restore it with velero restore on a new cluster.

Cost — the area that leaks fastest #

Cluster costs inflate quickly without deliberate attention. Standard cost check items for operational clusters:

1. Combination of Karpenter + Spot #

Karpenter, briefly mentioned in #1, creates the largest savings cost-wise.

Karpenter NodePool — Spot priority

spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: NotIn
          values: ["m5.metal"]   # exclude too-large instances
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s

Karpenter provisions instances that exactly match the workload’s resource requirements, at Spot prices. Savings of 50–70% over On-Demand are typical, and consolidation automatically merges underutilized nodes.

2. RDS instance right-sizing #

RDS cost is the sum of instance class + storage + IOPS. Since the instance class accounts for the largest share, a quarterly right-sizing review is standard.

Signals seen in RDS Performance Insights

- Average CPU utilization below 30% → consider one class smaller
- max_connections utilization below 50% → consider both pool size / instance
- Storage IOPS max below 50% of baseline → consider downsizing to gp3

Because RDS is a managed service, changing the instance class requires only one reboot. Quarterly right-sizing is consistently one of the highest-impact cost-saving levers.

3. NAT Gateway data transfer #

The NAT Gateway cost mentioned in #1 consists of an hourly charge plus a per-GB data transfer fee. Every time private subnet workloads communicate externally the cost accumulates, and it surprisingly often becomes a large line item.

NAT data transfer reduction patterns

- Adopt VPC Endpoint — AWS services like S3 / ECR / DynamoDB bypass NAT via VPC Endpoint
- Same-region RDS bypasses NAT via VPC internal routing (VPC peering / Transit Gateway)
- If frequently calling external APIs, adopt a caching layer

It is common for a $100–$500/month NAT cost to drop to less than half after adopting VPC Endpoints.

4. EBS snapshots and unused resources #

Unused resources to check periodically

- Old EBS snapshots (manually created beyond RDS auto snapshots)
- Old AMIs
- Detached EBS volumes (remnants of old node groups)
- Unused ALB / NLB (remnants of old Ingress)
- Old images in ECR (auto-deletion via lifecycle policy recommended)

ECR lifecycle policy is managed in a single Terraform resource.

terraform — ECR lifecycle

resource "aws_ecr_lifecycle_policy" "myshop_api" {
  repository = aws_ecr_repository.myshop_api.name

  policy = jsonencode({
    rules = [
      {
        rulePriority = 1
        description  = "Keep last 30 production tags"
        selection = {
          tagStatus     = "tagged"
          tagPrefixList = ["v"]
          countType     = "imageCountMoreThan"
          countNumber   = 30
        }
        action = { type = "expire" }
      },
      {
        rulePriority = 2
        description  = "Expire untagged after 7 days"
        selection = {
          tagStatus   = "untagged"
          countType   = "sinceImagePushed"
          countUnit   = "days"
          countNumber = 7
        }
        action = { type = "expire" }
      }
    ]
  })
}

5. Cost visualization — Kubecost / OpenCost #

OpenCost (open source) install

helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm install opencost opencost/opencost \
  -n opencost --create-namespace

OpenCost combines Prometheus metrics with cloud cost APIs to display per-namespace and per-workload costs. Questions like “how much does the myshop namespace spend per month, and which workload is the most expensive?” become answerable at a glance. In environments where cost responsibility is distributed by team, OpenCost is close to the standard.

Security — standard items of periodic checks #

Cluster security is not a one-time setup but an accumulation of periodic checks. Three standard items:

1. CIS Benchmark — kube-bench #

kube-bench — CIS Benchmark check

kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job-eks.yaml
kubectl logs -l job-name=kube-bench

kube-bench automatically checks items from the CIS Kubernetes Benchmark. Because the EKS control plane is managed, the checks are limited to nodes and workload manifests. Running it quarterly and addressing FAIL items is standard.

2. Container image scanning — Trivy #

.github/workflows/build.yml — adding image scan step

- name: Trivy image scan
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: 123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myshop-api:${{ steps.meta.outputs.tag }}
    format: sarif
    severity: CRITICAL,HIGH
    exit-code: 1   # build fails on CRITICAL discovery
    output: trivy-results.sarif

- name: Upload SARIF to GitHub Security tab
  uses: github/codeql-action/upload-sarif@v3
  with:
    sarif_file: trivy-results.sarif

Trivy scans base image OS packages and application dependencies for known CVEs. When a CRITICAL vulnerability is found, the build fails and the image is blocked from entering ECR. ECR Enhanced Scanning (paid) is a managed alternative covering the same ground.

3. RBAC permission audit #

Permission usage check

# See all ClusterRoleBindings
kubectl get clusterrolebindings -o wide

# Find subjects granted specific permissions
kubectl auth can-i --list --as=system:serviceaccount:myshop:myshop-api

# External tools — krane (Salesforce)
# or rbac-tool by InsightCloudSec

Auditing RBAC quarterly and cleaning up “unused permissions,” “excessive permissions,” and “stale ServiceAccounts” is standard. The kubectl auth can-i command covered in Advanced #2 is the everyday check tool.

4. Policy engine — Kyverno enforces admission #

Adopting Kyverno, as covered in Advanced #3, enforces policies like the following at the admission stage.

kyverno-policies.yaml — operational standard policy set

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-image-from-ecr
spec:
  validationFailureAction: Enforce
  rules:
    - name: ecr-only
      match:
        resources:
          kinds: [Pod]
      validate:
        message: "Images must come from our ECR registry."
        pattern:
          spec:
            containers:
              - image: "123456789012.dkr.ecr.*"

Five to ten such policies form the standard guardrails of a production cluster. Even if someone accidentally deploys an external image or a container without resource limits, it is rejected at the admission stage.

Periodic operations calendar #

Arranged into a calendar, the items above look like this:

Cycle	Work
Daily	Alert review, Grafana dashboard check
Weekly	New security patch review, ECR Trivy scan result review
Monthly	Cost review (OpenCost), unused resource cleanup, SLI/SLO report
Quarterly	EKS minor upgrade, RDS right-sizing, RBAC audit, recovery drill, kube-bench
Half-yearly	Comprehensive security check (external audit / pentest), DR simulation
Yearly	Cluster architecture review, manifest modernization

Having this calendar on a single page with assignees defined is the goal for an operational team. The most dangerous pattern is “it’s running fine, so I’ll leave it alone” after the initial setup.

Series retrospective — what entered hands through 6 K8s Practice posts #

Being the last post, here is a look back at all six.

#1 — EKS cluster setup. VPC, EKS, node groups, IRSA, and standard addons in one Terraform codebase. eksctl comparison, Karpenter preview.
#2 — App deployment skeleton. Standard 9-bundle of Deployment + Service + Ingress + ConfigMap + Secret + HPA + PDB. Auto-provisioning ALB via AWS Load Balancer Controller. Per-environment values via Helm chart for dev / prod.
#3 — DB integration. RDS Terraform, Secrets Manager, secret syncing via External Secrets Operator. PgBouncer connection pool. Helm hook-based migration Job. The more advanced pattern of RDS IAM authentication.
#4 — CI/CD pipeline. GitHub Actions OIDC for ECR push without static keys. Manifest repo auto commit. ArgoCD App of Apps. Argo Rollouts canary.
#5 — Monitoring/alarming. kube-prometheus-stack one-time install. ServiceMonitor + PrometheusRule. 4 golden signals. Alertmanager severity branching. Loki + CloudWatch.
#6 — Operations checklist. EKS upgrade, PITR backup/recovery, Karpenter Spot cost savings, kube-bench / Trivy / Kyverno security.

The scenario (myshop-api) established in the first post has been taken through a complete operational cluster lifecycle across six posts. This is the point where a concrete rather than abstract view of adopting and running Kubernetes starts to feel natural.

K8s track full retrospective — 26 posts #

Basics 7 + Intermediate 7 + Advanced 6 + Practice 6 comes to 26 posts, and adding Docker Basics 6 brings it to 32. This number conveys the scale of the K8s track and the surrounding Docker track at a glance.

The big picture of the K8s track

[Basics 7]      Model of one manifest — one flow of kubectl apply
[Intermediate 7] Depth of how that manifest runs in operational clusters
[Advanced 6]    Grain of policy / extension / observation / synchronization on top
[Practice 6]    One cycle of a real service on EKS — myshop-api

Each track builds on the previous one, and after working through all 26 posts, the next stage comes into focus.

Can read the intent of a manifest in a single glance
Decisions about new cluster setup, extensions, and operations feel within reach
When incidents occur, can immediately narrow down where to look first
The periodic cycles of cost, security, and upgrades are internalized as a calendar

After that — beyond the K8s track #

This track is not the destination. Deeper topics built on top of Kubernetes remain.

Service Mesh — Istio / Linkerd. mTLS, fine-grained traffic routing, observability mesh.
MLOps on K8s — Kubeflow, KServe, Argo Workflows. Dedicated stack for ML model training, deployment, serving.
Multi-cluster — patterns beyond single-cluster limits. Cluster federation, multi-region, ArgoCD ApplicationSet.
eBPF in depth — territory beyond Cilium. Next generation of security / observability / networking.
eks-anywhere / on-prem K8s — cluster operations beyond managed EKS.

These topics each merit their own series, and completing the 26-post K8s track puts you at the starting line for all of them.

Closing #

This wraps up the 6-post K8s Practice series and the entire 26-post K8s track. This post covered the periodic cycle of running a cluster safely at year scale — EKS minor upgrades, RDS PITR with quarterly recovery drills, Karpenter + Spot for cost savings, and periodic security checks via kube-bench, Trivy, and Kyverno — which together form the standard skeleton of the operations cycle. The destination of this track comes down to a simple sentence: can you confidently say “the cluster is running well”? The point where a single person can naturally connect the model of one manifest to the operational tasks on a quarterly calendar is the final stage of the K8s track.