Contents
26 Chapter

Operations Checklist

The last chapter of Part 4 (EKS in Production). Standing up a cluster reliably and operating it safely over a year are different kinds of work. We organize the EKS minor upgrade cycle, the node-group replacement pattern, RDS PITR and quarterly recovery drills, the path of taming cost with Karpenter + Spot, and the flow of regularizing security checks with kube-bench · Trivy · Kyverno. Finally, we bring together a retrospective on the 6 chapters of Part 4 (Chapters 21 ~ 26) and the 26 chapters of Parts 1 ~ 4.

The last chapter of Part 4 (EKS in Production). Having gone through Chapter 21 EKS cluster setup ~ Chapter 25 Monitoring · alerts, myshop-api is in a state where deployment · DB · CI/CD · observability are all automated on top of the empty cluster first created. At this point the cluster is rolling well, but operating it safely on a yearly basis is a different kind of work. K8s puts out 3 minor versions a year, AWS releases new instance types every quarter, and RDS has its maintenance windows scheduled regularly. This chapter organizes that regular operations cycle — upgrade, backup · recovery, cost, and security.

By the end of this chapter we hold a Part 4 retrospective and a retrospective on this book’s Parts 1 ~ 4 (Chapters 1 ~ 26). After that we move on to the full operational scope of Part 5 (Operations · Debugging · Cost) and Part 6 (Capstone).

EKS upgrade — at least one minor version a year #

K8s’s own version policy is clear.

  • Minor versions (1.30 → 1.31 → 1.32) — released on roughly a 4-month cycle.
  • The support period of each minor version — about 14 months after release.
  • EKS’s support period — 14 months of standard support + an additional 12 months of extended support (paid).

To keep a production cluster within standard support, at least one minor upgrade a year is needed. Managing it with a quarterly update calendar is standard.

The standard flow of an upgrade #

One cycle of an EKS minor upgrade
1. Review the release notes — the new version's deprecations, removed APIs
2. Upgrade the dev cluster first
3. Run it in dev for 1 ~ 2 weeks
4. Clean up deprecated APIs in the manifests
5. Run upgrade-check tools (pluto, kubent)
6. Upgrade the prod control plane
7. Upgrade the prod node group (rolling)
8. Upgrade the add-ons (vpc-cni, coredns, kube-proxy, ebs-csi)

The place that takes the most time at each stage is step 4 — cleaning up deprecated APIs. K8s removes some APIs each minor version, so if old APIs are in your manifests, they’re rejected on the new cluster.

Check tools — pluto and kubent #

pluto — check deprecated APIs in manifests
pluto detect-files -d charts/ --target-versions k8s=v1.31
kubent — check deprecated APIs alive in the cluster
kubent --target-version 1.31

The two tools combine to find deprecated APIs “in both my manifests and my cluster.” Thanks to the git single-source model of Chapter 20 GitOps, if pluto scans just the one manifest repo, the deprecated APIs of every environment are caught at once.

Control plane upgrade — one Terraform line #

terraform — just change cluster_version and apply
module "eks" {
  # ...
  cluster_version = "1.31"   # 1.30 -> 1.31
}

terraform apply triggers the minor upgrade of the EKS control plane. EKS upgrades the control plane with zero downtime — user workloads are unaffected, and kubectl keeps working. It takes about 30 minutes ~ 1 hour. Bumping the cluster_version line by one in the Terraform manifest of Chapter 21 is the starting point of operations.

Node group upgrade — Rolling vs Blue-green #

There are two patterns for upgrading a node group.

PatternModel
In-place rollingReplace the nodes in the same Managed Node Group one at a time with the new AMI
Blue-greenCreate a new Managed Node Group, migrate the workloads, then delete the old group

The default for an EKS Managed Node Group is in-place rolling. EKS automatically runs the cycle of cordon → drain Pods → bring up a new node → move Pods to the new node → delete the old node. The PodDisruptionBudget created in Chapter 22 App deployment skeleton is decisive at this point — without a PDB, the Pods of the same workload can all go down at once, which appears as downtime to users. It is one line in one manifest from Chapter 22, but it functions as a safety line that prevents downtime in an upgrade a year later.

Trigger the node group upgrade
# after the control plane has gone up to 1.31
aws eks update-nodegroup-version \
  --cluster-name myshop-prod \
  --nodegroup-name general \
  --region ap-northeast-2

If you’re worried about the impact of the upgrade on a large cluster, the blue-green pattern is safer — it creates a new node group, gradually migrates the workloads with the Karpenter disruption control of Chapter 13 Autoscaling, then empties the old group.

Add-on upgrade #

After the control plane and nodes have become the new version, the add-ons go up together too.

terraform — update the add-on versions
cluster_addons = {
  vpc-cni = {
    most_recent = true   # or an explicit version
  }
  coredns = {
    most_recent = true
  }
  # ...
}

If you set most_recent = true, Terraform automatically upgrades to the latest add-on version compatible with the K8s version. When you need an explicit version pin, use the addon_version field. The VPC CNI of Chapter 15 CNI in depth and the IRSA OIDC provider of Chapter 16 RBAC / ServiceAccount in depth are both refreshed together with this cycle.

Backup and recovery — RDS’s PITR is the first line of defense #

Almost all of myshop’s data is in RDS. The K8s cluster itself can be reconstructed even if it collapses, since the manifests are in git, but RDS data, once lost, is hard to recover. Thanks to the git single-source model of Chapter 20 GitOps, the cluster’s stateless part is essentially reproducible, and the real risk of operations concentrates on stateful systems.

Automatic snapshots + PITR #

The one line backup_retention_period = 30 written in the Terraform of Chapter 23 DB integration produced the following.

  • Daily automatic snapshots — the automatic backup runs at the backup_window (03:00 ~ 04:00 UTC) time.
  • PITR (Point-in-Time Recovery) — recovery to an arbitrary moment within the past 30 days, at 1-second granularity, is possible.

PITR is RDS’s most powerful backup feature. When a user accidentally runs DELETE FROM orders; in prod, you can recover a new RDS instance to the moment just before that, in 5 minutes, and compare the data.

PITR recovery to a point 30 minutes ago — to a new instance
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier myshop-prod \
  --target-db-instance-identifier myshop-prod-recovery \
  --restore-time 2026-05-21T08:30:00Z \
  --db-subnet-group-name myshop-prod \
  --vpc-security-group-ids sg-xxxxx

This command creates a new instance with the data from 30 minutes ago. The standard incident response is the flow of bringing the undamaged data from that instance with a SELECT and restoring it to prod. We point this out alongside the result of the pre-upgrade backup check in Chapter 30 Upgrade strategy §“Backup and RPO / RTO.”

Recovery drills — a quarterly check #

Seeing automatic backups roll and feeling reassured is different from recovery actually being possible. It’s standard to simulate a recovery once a quarter.

Quarterly recovery drill procedure
1. Recover a new instance with PITR (the command above)
2. Sample-check the new instance's data (row count, recent transactions)
3. Change the config so the dev environment's myshop-api points to the new instance
4. Function test (create a fictional order, query)
5. Delete the new instance + document the result

If even one failure comes out of this drill, it’s a signal you have to set as operational priority number one. “There is a backup” is not the standard; “recovery is verified” is the operational standard.

Cluster recovery — Velero #

The recovery of the K8s cluster itself has the GitOps repo as the first line of defense, but to preserve even the dynamic state that’s in etcd itself (user-created PVCs, automatic ConfigMap updates, etc.), Velero is the standard tool.

Velero — back up the cluster to S3
velero install \
  --provider aws \
  --bucket myshop-velero-backups \
  --region ap-northeast-2 \
  --plugins velero/velero-plugin-for-aws:v1.10.0 \
  --backup-location-config region=ap-northeast-2

# daily automatic backup
velero schedule create daily --schedule="0 2 * * *" --ttl 720h

Velero regularly backs up K8s objects + EBS volume snapshots from etcd to S3. When you’ve lost the whole cluster, you can recover on a new cluster with velero restore. It’s the shape where the PV model of Chapter 9 PV / PVC / StorageClass becomes a full backup target.

Cost — the area that leaks the fastest #

A K8s cluster’s cost balloons fast if you don’t intend otherwise. We point out the standard items of a production cluster’s cost check. Full cost optimization is covered in Chapter 28 Cost optimization, but at the operations-checklist level we organize five sources in advance.

1. Combining Karpenter + Spot #

Karpenter, mentioned briefly in Chapter 21 EKS setup, makes the biggest savings on the cost side.

Karpenter NodePool — Spot preference
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: NotIn
          values: ["m5.metal"]   # exclude instances that are too large
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s

Karpenter brings up the instance that exactly matches the workload’s resource demand, at Spot price. About 50 ~ 70 % savings vs ON_DEMAND is common, and consolidation automatically removes underutilized nodes. This is where the Karpenter operation covered in Chapter 13 Autoscaling connects directly to cost optimization.

2. RDS instance right-sizing #

RDS cost is the sum of instance class + storage + IOPS. The instance class takes the biggest share, so a quarterly right-sizing review is standard.

Signals you see in RDS Performance Insights
- average CPU usage below 30% -> consider a class one step smaller
- max_connections usage below 50% -> review both the pool size / the instance
- the max of storage IOPS below 50% of the baseline -> consider downsizing to gp3

Because it’s a managed service, an instance class change finishes with a single reboot. Quarterly right-sizing is one of the biggest cost-saving items. The adoption of PgBouncer in Chapter 23 is the step that naturally lowers this section’s second signal (max_connections usage).

3. NAT Gateway data transfer #

The NAT Gateway cost touched on briefly in Chapter 21 is an hourly + per-GB data-transfer charge. Cost piles up every time a private-subnet workload communicates with the outside, and it often takes a surprisingly large share.

NAT data-transfer saving patterns
- Adopt VPC Endpoints — AWS services like S3 / ECR / DynamoDB bypass NAT via VPC Endpoints
- RDS in the same region is VPC-internal routing, so it bypasses NAT (VPC peering / Transit Gateway)
- If you call external APIs often, adopt a caching layer

It’s common for a $100 ~ $500 per-month NAT cost to drop to less than half of that with the adoption of VPC Endpoints.

4. EBS snapshots and unused resources #

Unused resources to check regularly
- old EBS snapshots (ones made manually besides RDS automatic snapshots)
- old AMIs
- detached EBS volumes (leftovers from old node groups)
- unused ALBs / NLBs (leftovers from old Ingresses)
- old images in ECR (auto-delete with a lifecycle policy recommended)

The ECR lifecycle policy is organized with one Terraform file.

terraform — ECR lifecycle
resource "aws_ecr_lifecycle_policy" "myshop_api" {
  repository = aws_ecr_repository.myshop_api.name

  policy = jsonencode({
    rules = [
      {
        rulePriority = 1
        description  = "Keep last 30 production tags"
        selection = {
          tagStatus     = "tagged"
          tagPrefixList = ["v"]
          countType     = "imageCountMoreThan"
          countNumber   = 30
        }
        action = { type = "expire" }
      },
      {
        rulePriority = 2
        description  = "Expire untagged after 7 days"
        selection = {
          tagStatus   = "untagged"
          countType   = "sinceImagePushed"
          countUnit   = "days"
          countNumber = 7
        }
        action = { type = "expire" }
      }
    ]
  })
}

The image tag immutability decision of Chapter 24 CI / CD and this lifecycle policy roll together — because it’s immutable you can’t overwrite the same tag, so cleaning up old tags with a lifecycle is the only cleanup path.

5. Cost visibility — Kubecost / OpenCost #

Install OpenCost (open source)
helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm install opencost opencost/opencost \
  -n opencost --create-namespace

OpenCost combines Prometheus metrics and the cloud cost API to show cost per namespace · workload. “How much does the myshop namespace spend a month, and which workload is the bulk of it” is visible at a glance. The label standard (team / env / cost-center) of Chapter 7 Namespace and labels is used in this section as the key for full cost-responsibility allocation. In environments where cost responsibility falls per team it’s nearly standard, and we point it out in more detail in Chapter 28.

Security — the standard items of a regular check #

A production cluster’s security is not a one-time setup but the accumulation of regular checks. We point out three standard items. The full scope of secret operations is covered in Chapter 29 Secrets operation, but at the operations-calendar level we organize four (CIS Benchmark · image scan · RBAC audit · admission policy) first.

1. CIS Benchmark — kube-bench #

kube-bench — CIS Benchmark check
kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job-eks.yaml
kubectl logs -l job-name=kube-bench

kube-bench automatically checks the items of the CIS Kubernetes Benchmark. Because EKS’s control plane is managed, the check items are limited to the node + workload manifests. Running it quarterly and organizing the FAIL items is standard.

2. Container image scan — Trivy #

.github/workflows/build.yml — adding an image-scan step
- name: Trivy image scan
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: 123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myshop-api:${{ steps.meta.outputs.tag }}
    format: sarif
    severity: CRITICAL,HIGH
    exit-code: 1   # build fails if CRITICAL is found
    output: trivy-results.sarif

- name: Upload SARIF to the GitHub Security tab
  uses: github/codeql-action/upload-sarif@v3
  with:
    sarif_file: trivy-results.sarif

Trivy scans the base image’s OS packages + the application dependencies’ known vulnerabilities (CVEs). If CRITICAL is found, it blocks with a build failure so the new image can’t get into ECR. It goes in as one step of the GitHub Actions workflow of Chapter 24 CI / CD pipeline, so the security check becomes a natural part of the deployment pipeline. ECR Enhanced Scanning (paid) is also a managed option that fills the same area.

3. RBAC permission audit #

Check the permission-usage situation
# see all ClusterRoleBindings
kubectl get clusterrolebindings -o wide

# find the subjects granted a specific permission
kubectl auth can-i --list --as=system:serviceaccount:myshop:myshop-api

# external tools — krane (Salesforce) or rbac-tool

The standard flow is to audit RBAC quarterly and clean up “unused permissions,” “excessive permissions,” and “leftovers from old ServiceAccounts.” The kubectl auth can-i covered in Chapter 16 RBAC / ServiceAccount in depth is the tool of daily checks, and the standard ClusterRoles (view / edit / admin) of Chapter 14 RBAC / NetworkPolicy / ResourceQuota become the baseline of the audit.

4. Policy engine — enforce admission with Kyverno #

If you adopt the Kyverno covered in Chapter 17 Admission Controller and Chapter 18 CRD and Operator, policies like the following are enforced at the admission stage.

kyverno-policies.yaml — the standard production policy set
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-image-from-ecr
spec:
  validationFailureAction: Enforce
  rules:
    - name: ecr-only
      match:
        resources:
          kinds: [Pod]
      validate:
        message: "Images must come from our ECR registry."
        pattern:
          spec:
            containers:
              - image: "123456789012.dkr.ecr.*"

5 ~ 10 policies like this are the standard guardrails of a production cluster. Even if someone accidentally deploys an external image or a container without limits, it’s rejected at the admission stage. In environments where the ArgoCD auto-sync of Chapter 24 is adopted, this admission guardrail becomes the last line of protection for incident prevention.

The regular operations calendar #

Organizing the items above into a calendar gives the following.

CadenceWork
DailyAlert review, Grafana dashboard check
WeeklyReview new security patches, review the ECR Trivy scan results
MonthlyCost review (OpenCost), unused-resource cleanup, SLI / SLO report
QuarterlyEKS minor upgrade, RDS right-sizing, RBAC audit, recovery drill, kube-bench
Half-yearlyComprehensive security check (external audit / pentest), DR simulation
YearlyCluster architecture review, manifest modernization

The operations team’s goal is for this calendar to be organized on one page with assigned owners. “It’s rolling well, so I forget about it and live on” after a one-time setup is the most dangerous pattern.

Part 4 retrospective — what the 6 chapters of EKS in Production left in your hands #

We point out the 6 chapters of Part 4 once.

  • Chapter 21 EKS cluster setup — organized the VPC · EKS · node group · IRSA · standard add-ons in one codebase with Terraform. The eksctl comparison and Karpenter preview were included too.
  • Chapter 22 App deployment skeleton — wrote the standard 9-object set of Deployment + Service + Ingress + ConfigMap + Secret + ServiceAccount + HPA + PDB + Namespace. We followed through to automatic ALB provisioning with the AWS Load Balancer Controller and per-environment values split for dev / prod with a Helm chart.
  • Chapter 23 DB integration — covered the RDS Terraform, Secrets Manager, secret synchronization with the External Secrets Operator, the PgBouncer connection pool, the Helm hook-based migration Job, and the more advanced shape of RDS IAM authentication.
  • Chapter 24 CI / CD pipeline — organized the GitOps pipeline of ECR push without static keys via GitHub Actions OIDC, manifest repo auto-commit, ArgoCD App of Apps, and Argo Rollouts canary.
  • Chapter 25 Monitoring · alerts — unpacked the one-command install of kube-prometheus-stack, ServiceMonitor + PrometheusRule, the 4 golden signals, the Alertmanager severity split, and the two axes of Loki + CloudWatch.
  • This chapter (Chapter 26 Operations checklist) — covered the regular cycle of EKS upgrade, PITR backup · recovery, Karpenter Spot cost saving, and kube-bench / Trivy / Kyverno security.

The fictional scenario set in Part 4’s first chapter (myshop-api) was completed into a real production cluster over 6 chapters. It’s the stage where the view of adopting and operating EKS concretely, not abstractly becomes clear.

Parts 1 ~ 4 retrospective — what the 26 chapters left in your hands #

We organize the big picture of this book’s Parts 1 ~ 4, that is, Chapters 1 through 26, once.

The big picture of Parts 1 ~ 4
[Part 1 Fundamentals (Ch 1 ~ 7)]     the model of a single manifest — the flow of a single `kubectl apply`
[Part 2 Workloads and Operations     the depth of that manifest rolling on a production cluster
   (Ch 8 ~ 14)]                       — StatefulSet, PV, Ingress, health, resources, RBAC
[Part 3 Depth (Ch 15 ~ 20)]          the policy · extension · observation · synchronization layer on top
                                      — CNI, IRSA, Admission, CRD, observability, GitOps
[Part 4 EKS in Production (Ch 21 ~ 26)] a real service on EKS — myshop-api

It’s a structure where each part becomes the input for the next, and by the point you’ve followed through all 26 chapters you have the following.

  • The field of view to read the intent of a single manifest at a glance
  • A state where the decisions of a new cluster’s setup · extension · operation are in your head
  • The instinct to immediately point out where to look first when an incident occurs
  • The shape where the regular cycle of cost · security · upgrade is captured in your head as a calendar

Next chapter — full operations of Part 5 #

At this point myshop-api is in a state where everything — from code to deployment · operations · observation · the regular cycle — is automated. But the incident response side of operations is an area you can’t solve with automation. The shape of the incident, where to look first, and which tool shows which signal still have to be covered separately.

The 4 chapters of Part 5 (Operations · Debugging · Cost) fill that gap.

  • Chapter 27 kubectl debugging patterns — the diagnostic path per incident type for Pod / Service / Ingress / Node. The responsibility-tracing method for recurring incidents like CrashLoopBackOff, ImagePullBackOff, OOMKilled, OutOfSync.
  • Chapter 28 Cost optimization — the chapter’s 5 cost sources. The decision tree of Spot · Karpenter · right-sizing · Savings Plans.
  • Chapter 29 Secrets operation — the base64 limit of K8s Secret, the comparison of sealed-secrets · external-secrets · SOPS, “zero passwords” operation with IRSA + RDS IAM auth, and the four axes of store · rotate · inject · audit.
  • Chapter 30 Upgrade strategy — the K8s minor release cycle, the order of control plane → data plane → add-ons, deprecated API detection (pluto / kubent / apiserver metrics), the safeguards of drain (PDB · terminationGracePeriodSeconds · preStop), and the 1-week-before / day-of / 1-week-after upgrade checklist.

After that, the Part 6 capstone Chapter 31 Deploying a fullstack app on EKS brings the book’s 30 chapters together in one project, and Appendix A From docker-compose to k8s wraps up with a migration guide for Docker / docker-compose users.

Exercises #

  1. Refill the 6 rows of §“The regular operations calendar” to fit your own operational scenario (or a learning myshop-api cluster). For each item, write together the owner (or your role) · the tool to use · the last run date · the next scheduled date. Check whether the filled calendar fits on one page, and if it doesn’t, judge what to take out.
  2. Upgrade your dev cluster’s EKS version by one minor step. Measure which of the 8 steps of §“The standard flow of an upgrade” eats up the most time, and compare in a single table which items pluto and kubent each caught for deprecated API cleanup. Trace with kubectl get events what protection the PDB of Chapter 22 actually provided during node rolling.
  3. Follow the RDS PITR recovery drill once through the 5 steps of §“Recovery drills.” In the function test of step 4, organize which scenario of myshop-api has the highest verification value, fitted to your own domain, and document the drill result as one page in the Runbook git repo of Chapter 25. Note the improvement points for the next quarter’s drill together.

In one line: a production cluster’s regular cadence combines quarterly EKS minor upgrades, RDS PITR recovery drills, Karpenter Spot, OpenCost cost visibility, and kube-bench / Trivy / Kyverno checks into a daily · weekly · monthly · quarterly · half-yearly · yearly calendar. “There is a backup” is not the standard; “recovery is verified” is the operational standard, and “it looked fine once, so I never check it again” is the most dangerous pattern. The 6 chapters of Part 4 moved the fictional scenario myshop-api from the abstract to the concrete, and the 26 chapters of Parts 1 ~ 4 built a view in which a single manifest and the quarterly operations calendar fit naturally together.

X