28 Chapter

Cost Optimization

The second chapter of Part 5. It covers the cost items pointed out through five sources in Chapter 26. It ties together the two axes of compute (nodes) and add-ons (LB · storage · network · control plane), the cost meaning of requests, the right-sizing of VPA · Goldilocks · KRR, the decision tree of Spot · Karpenter · Cluster Autoscaler, bin packing and descheduler, the visualization of OpenCost · Kubecost, chargeback / showback by namespace label, and PV · network cost — and it closes with a checklist for reviewing next month's bill.

This is the second chapter of Part 5 (Operations · Debugging · Cost). If Chapter 27, kubectl Debugging Patterns organized where to look first when an incident occurs, this chapter deals with the monthly bill — not an incident. It’s where the five cost sources pointed out in Chapter 26, The Operations Checklist §“Cost” become actual cost governance.

K8s cost balloons fast if you don’t intend otherwise. The most common answer to “why did next month’s bill double?” is not one big change but the sum of several small leaks. The goal of this chapter is the vision to grasp where those leaks are springing from and to plug them with a one-page checkup checklist each quarter.

The two axes of K8s cost #

Open up an EKS bill once and the cost splits into two parts.

Axis	Items
Compute (nodes)	EC2 instances, the nodes Karpenter brings up, ARM / x86 / GPU
Add-ons	EKS control plane, ALB / NLB, EBS / EFS, NAT Gateway, data transfer, ECR storage

The ratio differs by environment, but in a typical production cluster it’s roughly compute at 60 ~ 70 % and add-ons at 30 ~ 40 %. Compute savings are visible and produce the biggest effect, but it’s easy to forget that the add-ons’ NAT Gateway and data transfer often make a bigger leak than compute. Looking at the two axes together is the starting point of cost optimization.

How the trio in Chapter 21, EKS Cluster Setup §“The first impression of cost” — where we put prod’s starting cost at $200 ~ $300 a month — shows up in real operations is the body of this chapter.

The cost meaning of requests #

If Chapter 11, Resource Requests and Limits dealt with requests as the scheduler’s input, from an operational viewpoint requests are the reservation of resources the cluster will charge the user for. The sum of requests = the minimum node resources the cluster must bring up, and only when you fill 60 ~ 70 % or more of a node’s available capacity with requests does the Cluster Autoscaler of Chapter 13, Autoscaling make the decision to bring up a new node.

The problem is over-request.

a common example of over-request — Chapter 11's shortcoming, again in operations

resources:
  requests:
    cpu: "2"          # actual average usage 0.1 core
    memory: "4Gi"     # actual average usage 200Mi
  limits:
    cpu: "4"
    memory: "8Gi"

If this manifest is applied to 100 Pods, the cluster reserves 200 cores + 400 Gi. What’s actually used is 10 cores + 20 Gi. That’s node cost being charged for empty resources. It’s the most common source of cost leakage on a production cluster, and a single change can produce the biggest savings.

The rule for appropriate requests is roughly the following.

CPU requests = average usage × 1.2 ~ 1.5
Memory requests = P95 usage × 1.1 ~ 1.3 (memory is hard to make burstable)
limits are 2 ~ 4 times requests (Burstable QoS) or equal (Guaranteed QoS)

Measuring the average / P95 becomes visible by observing the container_cpu_usage_seconds_total and container_memory_working_set_bytes metrics of Chapter 25, Monitoring · Alerts for about a month.

VPA recommendation only — recommended values via PR #

The Vertical Pod Autoscaler (VPA) is a controller that analyzes a workload’s actual usage and recommends appropriate requests / limits. The auto-apply mode is risky in operations (Pod restarts become frequent), but the mode that turns on recommendation only is the pattern that’s safe in operations.

VPA — recommendation only

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: myshop-api-vpa
  namespace: myshop
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myshop-api
  updatePolicy:
    updateMode: "Off"   # recommend only, no auto-apply

updateMode: Off is the key — VPA only analyzes metrics and writes recommended values to status, never touching the Pod’s actual spec.

checking the recommended values

kubectl describe vpa myshop-api-vpa -n myshop

The “Container Recommendations” section of this output holds the appropriate CPU / Memory. The flow of reflecting those values into the Helm values of Chapter 22, The App Deployment Skeleton via PR is the operational standard. Automate the measurement, let a human make the decision — VPA’s safest operational mode.

Goldilocks and Robusta KRR provide the same direction with a different UI. Goldilocks gathers VPA recommendations per namespace into one dashboard, and KRR prints cluster-wide recommended values with one CLI command. Whichever tool you use, the point is that a human reflects the measured recommended values via PR.

bin packing — node utilization and the descheduler #

Even with appropriate requests, if a node is filled inefficiently, cost springs out. bin packing is the pattern of how densely a node’s resources are used.

measuring node utilization

kubectl top nodes
kubectl describe node <node-name>   # Allocatable vs Allocated

The “Allocated resources” section of kubectl describe node shows the sum of requests on that node. 60 ~ 80 % of a node’s available capacity is the appropriate operational utilization. Below 50 % means the node is over-provisioned; above 90 % risks new Pods failing to schedule.

When nodes with low utilization accumulate, the descheduler comes in.

descheduler — move Pods off under-utilized nodes

apiVersion: descheduler/v1alpha2
kind: DeschedulerPolicy
profiles:
  - name: LowNodeUtilization
    pluginConfig:
      - name: LowNodeUtilization
        args:
          thresholds:
            cpu: 20
            memory: 20
            pods: 20
          targetThresholds:
            cpu: 60
            memory: 60
            pods: 60
    plugins:
      balance:
        enabled: [LowNodeUtilization]

When the descheduler evicts the Pods of a node below 20 % utilization, Karpenter (or Cluster Autoscaler) reclaims that node. The PodDisruptionBudget of Chapter 22 is decisive again at this point — a workload without a PDB is the first victim of an evict.

Spot nodes — classifying safe workloads #

EC2 Spot instances are 50 ~ 90 % cheaper than ON_DEMAND, but AWS can reclaim them after a 2-minute notice. Using Spot safely is the biggest area of cost savings.

The standard is to split workloads into two classes.

Class	Characteristics	Spot suitability
interruptible	stateless, multiple replicas, fast restart	suitable — an API server like myshop-api
stateful / critical	StatefulSet, single instance, long initialization	unsuitable for Spot — at least some on ON_DEMAND

The standard way to express this classification as a manifest is node taint + tolerations.

Karpenter NodePool — a taint on spot nodes

spec:
  template:
    spec:
      taints:
        - key: karpenter.sh/capacity-type
          value: spot
          effect: NoSchedule
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]

Pod — a toleration that makes it schedulable on spot nodes

spec:
  tolerations:
    - key: karpenter.sh/capacity-type
      value: spot
      operator: Equal
      effect: NoSchedule

This pattern is the safety device that makes only workloads that have deliberately declared they will accept spot go to spot nodes. It prevents a newly created workload from accidentally landing on spot and being subject to reclamation.

Fargate Spot is a managed option in the same direction. It produces about 70 % savings without you having to worry about the nodes themselves, but it has a constraint of being unable to bring up some workloads like GPU / DaemonSet / privileged Pods.

Karpenter — the decision tree against Cluster Autoscaler #

This chapter covers the model pointed at in Chapter 13, Autoscaling §“Karpenter — EKS’s faster alternative.” The difference between Cluster Autoscaler and Karpenter is organized in one table.

Grain	Cluster Autoscaler	Karpenter
model	resizing a predefined Node Group	seeing Pending Pods and picking an instance on the spot
instance diversity	the predefined types of the Node Group	auto-compares 100+ instance types
provisioning speed	1 ~ 2 minutes (ASG-based)	30 seconds ~ 1 minute (direct EC2 calls)
consolidation	a separate tool (descheduler, etc.)	built in — auto-consolidates under-utilized nodes
operational burden	configuration per Node Group	two CRDs, NodePool / NodeClass

The mental model of Karpenter’s NodePool and NodeClass is the key.

NodeClass — AWS resource definition (where to bring up)

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2023
  role: "KarpenterNodeRole-myshop-prod"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: myshop-prod
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: myshop-prod

NodePool — scheduling policy (what kind of node to bring up)

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["t", "m", "c"]
        - key: karpenter.k8s.aws/instance-cpu
          operator: In
          values: ["2", "4", "8", "16"]
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
  limits:
    cpu: 1000

It’s the separation of NodeClass = where, NodePool = how. You put several NodePools (general / batch / gpu, etc.) on top of one NodeClass, and tie things so the appropriate NodePool is selected by a workload’s labels / taints.

on-demand fallback — the safety line for spot reclamation #

One of Karpenter’s strengths is on-demand fallback. If you put both spot and on-demand in requirements, at a time when spot instance reclamation is frequent it automatically switches to on-demand. It’s a pattern that takes spot’s cost advantage while guaranteeing availability on reclamation.

The decision tree #

choosing between Cluster Autoscaler and Karpenter

- a workload's resource needs are constant, and 1 ~ 2 instance types are enough
  -> Cluster Autoscaler is simpler
- workloads are diverse, and cost savings come first
  -> Karpenter
- making full use of spot's cost advantage
  -> Karpenter (auto instance diversification)
- a cloud other than AWS
  -> Cluster Autoscaler (Karpenter is AWS-centric)

This book’s standard path is to start with Cluster Autoscaler in Chapters 21 ~ 22 and switch to Karpenter once the traffic pattern has settled. Running the two tools on the same cluster simultaneously is not recommended — when the decision-maker for node reclamation becomes two, they interfere with each other’s behavior.

Reclaiming idle resources — a bundle of checkup tools #

After a month of operation, unused / idle resources accumulate. These are the automation tools for regular checkups.

Tool	Grain
Goldilocks	VPA recommended values as a per-namespace dashboard
Robusta KRR	print cluster-wide recommended values with one CLI command
AWS Cost Anomaly Detection	auto-alert on a spike versus the usual pattern
kubectl-rightsize	report the gap between requests vs actual usage

The standard flow is to gather these tools’ outputs into one page each quarter and put them into a PR. It’s common for 20 ~ 30 % of workloads to diverge from recommended values in one quarter, and that much right-sizing typically produces 20 ~ 40 % savings in compute cost.

Cost visualization — OpenCost / Kubecost / Cost Allocation Tags #

OpenCost — the open-source standard #

installing OpenCost

helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm install opencost opencost/opencost \
  -n opencost --create-namespace

OpenCost combines Prometheus metrics with the AWS Cost API to compute cost by namespace · Deployment · label. It combines naturally with the Prometheus of Chapter 25, and adding it as a Grafana data source lets you build a cost dashboard.

Kubecost — the commercial enhancement #

Kubecost is the commercial extension of OpenCost — finer visualization, automation of recommended values, SSO integration, multi-cluster cost consolidation, and so on are added. It’s natural for a small team to start with OpenCost and move to Kubecost at the point cost management becomes serious work.

AWS Cost Allocation Tags #

Terraform — default_tags (recalling Chapter 21)

provider "aws" {
  region = "ap-northeast-2"

  default_tags {
    tags = {
      Project     = "myshop"
      Environment = "prod"
      Team        = "backend"
      CostCenter  = "engineering"
    }
  }
}

The default_tags written in Chapter 21, EKS Setup is the foundation of this chapter’s cost allocation. Activating these tags as “Cost Allocation Tags” in AWS Cost Explorer makes the bill split by tag. If you don’t establish a tag standard in advance, cost allocation a year later is nearly impossible — establishing it at setup time is the key.

Cost allocation by namespace / label — chargeback vs showback #

In an environment where several teams share one cluster, the question of who used how much becomes a real operational issue. There are two models.

Model	Meaning
showback	shows each team’s usage. Doesn’t bill the cost. Induces behavior change
chargeback	actually deducts from each team’s budget. A forced savings effect

The starting point in most operational environments is showback. Sending a monthly per-team usage report from OpenCost’s output naturally gets a team to start cleaning up its own workloads’ requests and labels. The standard team / cost-center labels of Chapter 7, Namespace and Labels are used in this section as the key for real cost responsibility.

In an environment where the App of Apps of Chapter 20, GitOps has been introduced, embedding a team label in each ArgoCD Application manifest makes the label propagate automatically across the whole manifest.

PV / EBS cost — gp3 and lifecycle #

Storage cost is often the next-largest share after instance cost. We point at three items.

gp3 vs gp2 #

The change of EBS’s default option from gp2 to gp3 is the biggest cost-side change. gp3 is about 20 % cheaper at the same IOPS · throughput. Setting gp3 in the StorageClass manifest of Chapter 9, PV / PVC / StorageClass is the operational standard.

StorageClass — gp3 as default

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
  encrypted: "true"
allowVolumeExpansion: true

The accumulation of snapshots #

EBS snapshots accumulate automatically. RDS’s automatic snapshots are managed by the backup_retention_period of Chapter 23, but snapshots made by K8s’s VolumeSnapshot or Velero’s backups pile up forever without a separate lifecycle.

checking snapshot lifecycle

aws ec2 describe-snapshots \
  --owner-ids self \
  --query 'Snapshots[?StartTime<=`2024-01-01`].[SnapshotId,VolumeSize,Description]' \
  --output table

Reclaiming unused PVs #

checking PVs in the Released state

kubectl get pv | grep Released

A PV in the Released state is one where the Pod that held its PVC has been deleted but the PV itself remains. The EBS volume keeps being charged. If the ReclaimPolicy is Retain, manual cleanup is needed; if Delete, the EBS is deleted along with the PVC.

Network cost — the most hidden leak #

AWS network cost is the hardest item to find on a bill. There are three sources.

Cross-AZ traffic #

Even within the same region, communication between Pods in different AZs is charged $0.01 ~ $0.02 per GB (twice as much in reality, since it’s bidirectional). myshop-api’s 5 Pods are scattered across 3 AZs, and each time a Pod communicates with PgBouncer, some of it becomes cross-AZ traffic.

Topology-aware routing is the K8s 1.23+ feature that reduces this.

Service — topology-aware hints

apiVersion: v1
kind: Service
metadata:
  name: pgbouncer
  namespace: myshop
  annotations:
    service.kubernetes.io/topology-mode: Auto
spec:
  # ...

When this annotation is on, an endpoint in the same AZ is preferred. It’s typical for cross-AZ traffic to drop to half or less.

NAT Gateway data transfer #

The NAT cost pointed at in Chapter 21, EKS Setup §“single NAT vs NAT per AZ” is per-hour + per-GB. The per-hour is about $32/month per NAT, and the per-GB is proportional to the data transfer volume.

patterns for reducing NAT data transfer (recalling Chapter 26)

- VPC Endpoint — S3, ECR, DynamoDB, CloudWatch, Secrets Manager
  -> the biggest effect. Bypasses 60 ~ 80% of NAT traffic.
- caching external APIs — an in-instance cache or ElastiCache
  -> reduces repeated calls.
- communication with other AWS services in the same region via VPC-internal routing
  -> ECR, S3, DynamoDB are free.

The ALB’s LCU #

The ALB’s cost unit is the LCU (Load Balancer Capacity Unit) — it’s charged on the largest of four dimensions: new connections, active connections, processed data, and rule evaluations. The pattern of the one ALB built in Chapter 22, The App Deployment Skeleton tying together the three entry points of myshop-api, ArgoCD, and Grafana also helps LCU savings — you don’t make a separate ALB per workload.

A checklist for reviewing next month’s bill #

We organize a one-page standard checklist for monthly cost checkups.

monthly cost review — one page

[Compute]
- node utilization — is the average of kubectl top nodes 60 ~ 80%?
- spot ratio — spot usage versus interruptible workloads
- Karpenter consolidation — the number of node consolidation events in the last month
- VPA / KRR recommended values — the list of unreflected workloads and PR progress

[Add-ons]
- NAT data transfer volume — change versus the last month
- services where VPC Endpoint can be introduced — S3 / ECR / Secrets Manager
- cross-AZ traffic ratio — the list of Services where topology-aware routing can apply
- the number of ALB / NLB — whether there are entry points that can be consolidated

[Storage]
- gp2 PV remnants — candidates to migrate to gp3
- EBS volumes of Released PVs — the amount reclaimable
- old EBS / RDS snapshots — resources without a lifecycle applied
- ECR images — the lifecycle policy application state

[Visualization]
- the top 1, 2, 3 costs by team / by workload in OpenCost
- items that increased +20% or more versus the last month
- the alert history of Cost Anomaly Detection
- next month's forecast vs the budget

The goal of operations is for this checklist to fit on one page and be filled in regularly once a month. Don’t try to solve every item at once; the cumulative model of improving only one or two items each month is the sustainable flow.

Exercises #

Install VPA recommendation mode and Goldilocks on the dev cluster, analyze a month’s worth of myshop-api data, and obtain recommended requests / limits. Compute the gap between the current manifest’s requests and the recommended values as a percentage, and verify with OpenCost’s output how that gap is reflected in a month’s node cost. Write a PR reflecting the recommended values into the Helm values of Chapter 22, and compare the change in node utilization before and after applying with kubectl top nodes.
Write both capacity types, spot + on-demand, into Karpenter’s NodePool, and add a toleration to the myshop-api Deployment that accepts the spot taint. Measure how often spot instances are reclaimed over a week, and trace Karpenter falling back to on-demand on reclamation with kubectl get events. In one paragraph, organize the cost-savings effect of spot and how the PodDisruptionBudget of Chapter 22 plays a protective role at the time of reclamation.
Fill in the one page of §“A checklist for reviewing next month’s bill” on your own production cluster (or learning cluster). For each item, write the current value · the change versus the last month · the improvement possibility for next month, and set the top 3 sources of leakage as priorities. Combining with the regular operations calendar of Chapter 26, put into the calendar which item you’ll check at which point on a quarterly basis.

In one line: K8s cost is usually split between compute (60 ~ 70 %) and add-ons (30 ~ 40 %), and the biggest leaks are over-request and NAT Gateway · cross-AZ data transfer. The main levers are VPA recommendations plus Goldilocks / KRR, Spot + Karpenter with on-demand fallback, 60 ~ 80 % node utilization with descheduler + bin packing, gp3 + ECR lifecycle cleanup, and topology-aware routing + VPC Endpoint savings. Showing cost to teams with OpenCost and per-label chargeback / showback naturally encourages self-cleanup. Improving only one or two items each month is the sustainable path.

Next chapter #

If this chapter dealt with cost, the next chapter is the security chapter. Security on a production cluster is not a one-time setup but the accumulation of regular checkups, and it’s the stage of gathering into one operations manual the topics we’ve pointed at piecemeal across several chapters of this book (Chapter 6 ConfigMap · Secret, Chapter 14 RBAC / NetworkPolicy, Chapter 16 IRSA, Chapter 17 Admission Controller, Chapter 23 External Secrets).

Chapter 29, Secret Operations covers the flow of the essential limit of a K8s Secret (the point that base64 is not encryption), the comparison of the three options sealed-secrets / external-secrets / SOPS, the “zero passwords” operation of IRSA + RDS IAM auth, and the four axes of rotation · injection · audit.