13 Chapter

Autoscaling

A walkthrough of the three dimensions of automatic adjustment that absorb a production cluster's load swings without human intervention. The roles of HPA (Pod count) · VPA (Pod resources) · Cluster Autoscaler (node count), the metrics-server prerequisite, HPA's autoscaling/v2 manifest and proportional algorithm, the scale-up · scale-down asymmetry, custom metrics and KEDA, VPA's updateMode and the HPA · VPA conflict, and Karpenter.

The path through Chapter 12, Health checks was the story of how a single Pod stands on its own. In Chapter 11, resources.requests / limits we set how much resource a single Pod requests and how far it can use, and in Chapter 12 we expressed whether that Pod is alive, ready to receive traffic, or still slowly coming up with three kinds of probe. That wraps up the single-Pod model. But operational load swings on top of that model — lunchtime traffic doubles, dawn drops to a tenth, and a day with a marketing campaign can run five times the usual all day long. Operations where a human chases this variation each time with kubectl scale deployment ... --replicas=... do not last long. This chapter brings together the three dimensions of automatic adjustment that fill that gap — HPA / VPA / Cluster Autoscaler.

By the end of this chapter you’ll have an automation layer that keeps humans from chasing load swings by hand. At the same time, the premises needed for that automation to run — the existence of requests, metrics-server, reasonable values for behavior, the CA tags — become visible together.

What autoscaling solves #

In a production cluster, the pattern in which load varies is usually one of three — variation by time of day (day · night), event-driven surges (campaign · sale · news), and cumulative growth from added workloads. Operations adjusted by human hand usually reach a limit through the following stages.

At first, setting replicas generously is enough. You always keep about double the usual up.
As time passes, you learn that “generous value” is insufficient in some time slots and wasteful in others. Both cost and resources are leaking on both sides.
Someone makes a schedule that applies different replicas to three slots: weekday day · night · weekend. For a while it rolls.
A campaign comes in, or external traffic spikes in an accident once, and a human gets up at dawn and types kubectl scale. Soon that becomes a recurring task.

The way K8s expresses this problem is the three dimensions of autoscaler. Each automatically adjusts a different axis.

Autoscaler	What it adjusts	Signal	Target
HPA (Horizontal Pod Autoscaler)	Pod count (`replicas`)	CPU · memory utilization, custom metric	Deployment / StatefulSet
VPA (Vertical Pod Autoscaler)	A Pod’s resource requests · limits (`requests` / `limits`)	Past CPU · memory usage trend	Deployment / StatefulSet
Cluster Autoscaler (CA)	Node count	Pods in the `Pending` state, empty nodes	The cloud’s node group (ASG / MIG / VMSS)

It matters that the three axes are in a complementary relationship. Even if HPA adds Pods, if the nodes have no headroom the new Pods stall in Pending. Then CA brings up more nodes. VPA, in a separate cycle, tells you as a recommended value the fact that “this workload actually needs about 1Gi of memory.” Only when the three roll together is load variation absorbed without human intervention.

The premise called metrics-server #

For autoscaling to work, there must be a component that tells you the current resource usage inside the cluster. The K8s core doesn’t hold those metrics directly. Instead it provides a standardized interface (metrics.k8s.io), and the usual pattern is to install separately into the cluster a component that fills that interface. The most common implementation is metrics-server.

metrics-server periodically scrapes the kubelet’s /metrics/resource endpoint on each node in the cluster and holds the node and Pod CPU · memory usage in memory. kubectl top or the HPA controller queries those figures via the API.

check whether metrics-server is installed

kubectl top nodes
kubectl top pods -A

example output

NAME        CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
node-1      450m         22%    1.8Gi           45%
node-2      320m         16%    1.5Gi           37%

If figures show up, metrics-server is alive, and if a message like error: Metrics API not available comes out, it’s not installed or it’s dead. The per-environment installation methods are as follows.

Environment	metrics-server status
minikube	Enabled with one `minikube addons enable metrics-server`
kind	Needs manual install (`kubectl apply -f` or Helm)
EKS	Needs manual install. Helm or the official manifest
GKE	Enabled by default
AKS	Enabled by default

On EKS, metrics-server is missing right after you create a production cluster. It’s the first component you must install to use HPA · VPA. If you want to run HPA on custom metrics like queue length · request count beyond CPU · memory, then instead of (or alongside) metrics-server, components like Prometheus and the Prometheus Adapter, or KEDA, take that role — we look at this again later. Prometheus’s own setup and the cluster observability model are covered in Chapter 19, Observability.

HPA — automatically adjusting Pod count #

The autoscaler used most often and adopted first is HPA. It’s a model where instead of a human writing the replicas field, K8s fills it in automatically by looking at the average value of a metric.

The HPA manifest — CPU-based #

The simplest shape is based on CPU utilization.

hpa-cpu.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Let’s note the key fields one line each.

apiVersion: autoscaling/v2 — HPA’s current stable version. v1 can handle only one thing, CPU, and is bound to a single metric. From v2, multi-metric, custom metric, and the scale-up · scale-down asymmetric behavior (behavior) are all possible. A new manifest almost always uses v2.
scaleTargetRef — the target whose replicas to adjust. You can attach it to a Deployment / StatefulSet / ReplicaSet. A DaemonSet is bound to the node count, so it’s not an HPA target.
minReplicas / maxReplicas — the lower and upper bounds of automatic adjustment. A safety net that prevents the accident of unintentionally going to 0 or hundreds in operations. Setting the lower bound above 1 gives availability, and setting the upper bound reasonably protects cost · resources.
metrics — the array of which signal to decide by. The example above is type: Resource (a standard resource like CPU · memory) with target.type: Utilization (utilization), averageUtilization: 70 (an average of 70%). It adjusts replicas so that the CPU utilization average of all Pods becomes 70%.

To note one reason for setting minReplicas to 2 or more rather than 1 — it’s the availability requirement that there must be another Pod to receive traffic even at the moment one Pod dies or is terminated by an update. In Chapter 12 we controlled traffic entry with the readiness probe, but that control is a within-the-same-Pod story. The absence of the Pod itself must be filled by another Pod.

The HPA algorithm — one line of a proportion #

The formula by which HPA decides the new replicas value is simple.

HPA's desired replicas

desiredReplicas = ceil( currentReplicas * (currentMetricValue / targetMetricValue) )

In words — it looks at how many times the target the current average is and grows the Pod count by that ratio. It’s clear with examples.

currentReplicas	currentMetric (CPU avg)	targetMetric	Calculation	New replicas
5	70%	70%	5 × 1.0 = 5	5 (no change)
5	140%	70%	5 × 2.0 = 10	10
5	35%	70%	5 × 0.5 = 2.5 → ceil	3
10	105%	70%	10 × 1.5 = 15	15

The definition of the utilization (Utilization) in the numerator matters. CPU utilization is the ratio relative to the Pod’s requests. If a Pod holds requests.cpu: 500m and is actually using 700m, the utilization is 140%.

Because of this definition, one trap arises that ties directly to Chapter 11 — if a workload has no resources.requests, HPA’s Utilization metric doesn’t work. It’s because the denominator isn’t defined. Before adopting HPA, you must first check whether the target Deployment has CPU · memory requests in it. Leaving out this check leaves HPA stalled in an unknown or <unknown> state. The finished version of the diagnostic tree is organized in Chapter 27, kubectl debugging patterns.

If you want to run it even without requests, there’s a path of setting target.type to AverageValue instead of Utilization and writing an absolute value (e.g., 200m). It compares on an absolute-value basis rather than utilization. But this shape isn’t common, and the operational standard is requests + Utilization.

multi-metric — looking at several signals together #

If you put several items in the metrics array, HPA calculates desired replicas separately for each metric and adopts the largest of them. Looking at CPU and memory at the same time, it’s as follows.

hpa-cpu-memory.yaml — the metrics part

metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

If CPU suggests 5 replicas but memory needs 8, HPA adopts 8. It’s the conservative choice that fits the more burdened of the two.

A workload for which this shape often has meaning is a service holding a memory cache. To keep HPA from missing the signal when the pattern of CPU being idle but memory filling up appears, you have to look at memory together too.

The asymmetry of scale up and scale down — behavior #

HPA doesn’t smoothly adjust at one ratio every time. Left unchanged, two accidents happen in operations — reducing Pods too quickly when load drops briefly, which causes response to spike with a cold start when load rises again. And adding Pods excessively when load spikes briefly, which leaks resources and cost. The field that handles these two is behavior.

hpa-behavior.yaml — the behavior part

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
    selectPolicy: Max
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    selectPolicy: Max

Let’s note three things.

stabilizationWindowSeconds — the stabilization window for the decision. That the scale-down default is 300 seconds (5 minutes) is the key operational safeguard. It truly reduces only when CPU keeps a low value for 5 minutes. It doesn’t reduce on a signal that drops briefly and rises again. scale up is usually left at 0 seconds so it reacts immediately.
policies — the policy of how much to change in one round. There are two kinds, Percent (a ratio of the current count) and Pods (an absolute count), and periodSeconds is that policy’s interval. The example above, for scale up, allows the larger of “100% of the current (multiply by 2) or +4” every 15 seconds.
selectPolicy: Max / Min — sets which of several policies to adopt. Max is the most aggressive change, Min the most conservative.

In one line, the asymmetry is scale up fast, scale down slowly. The accident where load spikes and response time grows is immediately visible to users, but the cost of one or two more Pods being up is negligible for a short time. On the other hand, reducing too fast causes a cold start while scaling back up, which is directly visible to users. The pattern of explicitly pinning this asymmetry with behavior is the operational standard.

If you don’t write behavior itself, K8s’s reasonable defaults (scale up immediate, scale down 5-minute stabilization) apply. When first adopting it, you may start with the defaults, and the usual flow is to adjust to the workload’s characteristics.

Applying HPA and checking its behavior #

apply HPA and check status

kubectl apply -f hpa-cpu.yaml
kubectl get hpa
kubectl describe hpa web

get hpa example output

NAME   REFERENCE        TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
web    Deployment/web   55%/70%         2         20        4          5m

The 55%/70% in the TARGETS column is the current average / target value. If this value shows as <unknown>/70%, metrics-server isn’t alive or the target workload has no requests. A message like FailedGetResourceMetric also comes out in the events section of kubectl describe hpa.

The command to check the behavior once with a load test is as follows.

apply load and observe scale up

kubectl run load-gen --rm -it --image=busybox -- /bin/sh
# inside the container
while true; do wget -q -O- http://web.default.svc.cluster.local; done

Watching with kubectl get hpa -w in another terminal, you’ll see REPLICAS grow in proportion once TARGETS exceeds 70%. When you stop the load, it slowly decreases starting about 5 minutes later.

Custom metrics and KEDA — beyond CPU · memory #

There are workloads not sufficiently expressed by CPU · memory alone.

Queue consumers — workers that receive and process messages from SQS · Kafka · RabbitMQ. The queue length is the true signal, not CPU. Even when the queue is piling up, the worker’s CPU can be idle.
API gateways — requests per second (RPS) or concurrent connection count is a more direct signal than resource usage.
Event-driven workloads — function-style workloads that run only when there’s work.

If you apply only HPA’s CPU basis to these workloads, it’s a beat late from the true moment of load change, or misses the signal entirely.

Prometheus Adapter #

The first path to making HPA look at metrics beyond CPU · memory is the Prometheus Adapter. If Prometheus is installed in the cluster and the workloads expose metrics, the Prometheus Adapter exposes that Prometheus’s PromQL results via K8s’s custom.metrics.k8s.io API. HPA can then use those metrics like standard metrics.

In the manifest’s metrics array you write type: Pods or type: External to express which PromQL result to look at. We defer the deep part to a later K8s deep-dive book, but to show just the shape, it’s as follows.

custom metric example — excerpt

metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"

It adjusts replicas so that each Pod’s average RPS becomes 100. The metric’s definition (http_requests_per_second) is written as PromQL in the Prometheus Adapter’s config file.

KEDA — event-driven 0 → N #

KEDA (Kubernetes Event-Driven Autoscaling) is a component that goes one step further. It solves two things HPA can’t.

0 → N scaling — standard HPA’s minReplicas must be 1 or more. You can’t reduce Pods completely to 0 when there’s no work. KEDA reduces the workload to 0 during the time the queue is empty, and brings it up to 1 when a new message arrives. It’s a big difference on the cost side.
Connecting directly to various event sources — it supports more than 50 kinds of source as built-ins, like SQS, Kafka, RabbitMQ, Redis Streams, PostgreSQL, and Prometheus. Without writing PromQL like the Prometheus Adapter, a single KEDA ScaledObject manifest gives you queue-length-based scaling.

KEDA ScaledObject — SQS example excerpt

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: sqs-worker
spec:
  scaleTargetRef:
    name: sqs-worker
  minReplicaCount: 0
  maxReplicaCount: 30
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.ap-northeast-2.amazonaws.com/.../my-queue
        queueLength: "10"
        awsRegion: ap-northeast-2

KEDA internally creates and runs a standard HPA — when a ScaledObject is applied, a corresponding HPA is created automatically, and KEDA exposes the external metric via the K8s metrics API. Think of it as one layer of convenience stacked on top of a standard HPA. In clusters with many queue consumers or event workers, it’s becoming nearly a standard tool.

VPA — automatically adjusting a Pod’s resource requests #

If HPA is the dimension of “how many Pods,” then VPA is the dimension of “the size of one Pod.” In Chapter 11 we covered the cycle where a human looks at usage data and sets requests. VPA is an attempt to automate that work — it looks at a workload’s past CPU · memory usage trend, derives a recommended value, and per a policy applies that value and recreates the Pod.

The three components — recommender / updater / admission-controller #

VPA is not a single controller but a bundle of three components.

Component	Role
recommender	Gathers metrics and derives a recommended `requests` value. Records it in the VPA object’s `status.recommendation`
updater	If the recommender’s recommended value and the current Pod’s value diverge greatly, it evicts the Pod (causing recreation)
admission-controller	Injects the recommended value via a mutating admission webhook when a new Pod is created

The three components make a cycle — the recommender computes a recommended value, the updater finds a large difference and kills the Pod, and when the new Pod is created the admission-controller brings it up with a manifest that applied the recommended value. Without human intervention, requests is updated to fit the workload’s actual usage.

The VPA manifest and updatePolicy #

vpa-web.yaml

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web
  namespace: default
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: 2
          memory: 4Gi

The three values of updateMode are the core of the operational decision.

updateMode	Behavior
`"Off"`	Only derives a recommended value, doesn’t apply it. A human looks at it and reflects it in the manifest
`"Initial"`	Applies the recommended value only at the moment a Pod is newly created. Already-running Pods stay unchanged
`"Auto"` (= Recreate)	If a large difference appears, evicts the Pod and recreates it with the new recommended value

Auto looks like the end of automation, but in operations you have to be careful. VPA evicting a Pod means that Pod has to die once and come back up. On a StatefulSet, a workload running with a single replica, or a workload with a long startup probe time from Chapter 12, an eviction directly affects availability.

When first adopting VPA, the standard pattern is to almost always leave it at "Off". Gather recommended values over days · weeks, and only after confirming those values are reasonable do you move to Initial or Auto, or have a human reflect those recommended values into the manifest and commit.

check VPA recommended values

kubectl describe vpa web

the status.recommendation part — excerpt

Status:
  Recommendation:
    Container Recommendations:
      Container Name:  web
      Lower Bound:
        Cpu:     150m
        Memory:  256Mi
      Target:
        Cpu:     350m
        Memory:  512Mi
      Upper Bound:
        Cpu:     800m
        Memory:  1Gi

Target is the key recommended value. Think of Lower Bound and Upper Bound as a statistical confidence interval. If this recommended value differs greatly from the current manifest’s requests, having a human review that difference and reflect it in the manifest is the conservative operational way.

resourcePolicy’s minAllowed / maxAllowed #

In the above manifest’s resourcePolicy, minAllowed and maxAllowed are the upper · lower bounds of the recommended value. Without this safety net, VPA causes the accident of recommending requests too small from looking at an idle-time-slot value, or too large from looking at a temporary memory-leak pattern. In operations it’s good to almost always write these two values.

A cluster where VPA isn’t installed #

Unlike HPA’s metrics-server, VPA isn’t included in the K8s core. EKS · GKE · AKS all need a separate install — usually you install it with the official GitHub manifests or a Helm chart. Only GKE provides a managed option.

The conflict of HPA and VPA — don’t put both on the same metric #

Let’s note one trap frequently seen in operations. If you put a CPU-based HPA and a CPU-based VPA on the same workload at the same time, oscillation occurs. The reason is simple.

CPU load rises. HPA grows the Pod count per the proportion.
Since Pods increased, the average CPU usage per Pod drops.
VPA (Auto) looks at that dropped usage and decides “I should reduce requests.” It lowers the recommended value and recreates the Pod.
Since requests dropped, dividing the same usage by the denominator makes the utilization (Utilization) rise again. HPA grows Pods again.

It’s a cycle that doesn’t stop oscillating. There are two avoidance patterns.

Separate HPA’s and VPA’s metrics — for example, HPA on CPU, VPA on memory. The two cycles don’t shake each other’s denominator · numerator.
VPA at updateMode: "Off" — derive only recommended values, don’t auto-apply. A human reviews and reflects them in the manifest. HPA works unchanged.

Most production clusters go with the second pattern. HPA takes responsibility for dynamic load adjustment, VPA is left as a recommended-value tool, and a human reflects them into the manifest once per quarter or so. This separation is the safest starting line.

Cluster Autoscaler — node-level adjustment #

Even if HPA adds Pods, if there are no node resources for those Pods to go into, the Pods stall in Pending. It’s the state where the schedulability formula seen in Chapter 11 — that the headroom of a node’s allocatable minus the sum of already-reserved requests must be at or above the new Pod’s requests — isn’t satisfied. What fills this gap is the Cluster Autoscaler.

The behavior model #

CA’s behavior is simple in two directions.

scale up — if it sees a Pod in the Pending state, it calls the cloud API to add a node to the node group that can take that Pod’s requests. On AWS it grows the ASG’s desired capacity, on GCP the MIG, on Azure the VMSS.
scale down — if there’s a node with low utilization for a certain time, it moves the Pods on that node to other nodes and terminates the node. If there are Pods that can’t be moved (e.g., a case where a PV is attached only to that node, or a Pod with the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation), it leaves that node unchanged.

CA’s decision is based not on metrics but on requests and schedulability. Even if actual usage is idle, if the sum of requests is filling the nodes, a new Pod becomes Pending and CA adds a node. This model is at exactly the same layer as the expression in Chapter 11, “requests is the scheduler’s true currency.”

Node groups per cloud environment #

CA rolls paired with a cloud provider. The per-environment mapping is as follows.

Cloud	Node-group abstraction	Note
AWS EKS	Auto Scaling Group (ASG) or EKS managed node group	A separate ASG per availability zone recommended
GCP GKE	Managed Instance Group (MIG)	Enabled by default. GKE Autopilot abstracts the nodes themselves
Azure AKS	Virtual Machine Scale Set (VMSS)	Enabled as an AKS cluster option
On-prem	Cluster API + provider	Varies by environment

In the case of EKS, CA is usually installed with a Helm chart. A one-time setup is needed to attach the right tags to the ASG so CA can discover and manage it. GKE turns it on with a one-line option at cluster creation.

Karpenter — EKS’s faster alternative #

CA’s design goes through the cycle of “request one more slot of desired capacity to the ASG → a node is created per that ASG’s launch template → the kubelet registers it to the cluster.” That the node spec is predefined in the ASG is a constraint — if a Pending Pod requires large memory but the ASG’s instance type can only make small nodes, even bringing up that node causes the accident where the Pod still stays Pending.

Karpenter is settling in as a faster alternative to CA in the EKS environment. Karpenter’s differences are two.

Decides the node spec dynamically by looking at the Pending Pod — rather than a predefined ASG, it picks on the spot the instance type that best fits the Pending Pod’s requests and tolerations and brings it up directly via the EC2 API.
Fast provisioning — since it doesn’t go through the ASG step, the time from a node coming up to joining the cluster is usually shorter.

In new EKS clusters, the move toward Karpenter instead of CA is growing. The equivalent tools in GKE · AKS aren’t yet as settled as EKS’s Karpenter. Karpenter setup and cost-optimization patterns in the EKS environment are covered in Chapter 21, EKS cluster setup and Chapter 28, Cost optimization.

Common reasons CA doesn’t work #

Let’s note a few patterns where CA doesn’t roll as intended.

The node has no cluster-autoscaler-related tag — in the case of AWS, the ASG must have a tag like k8s.io/cluster-autoscaler/enabled for CA to see that ASG as a managed target.
The PodDisruptionBudget is too strict — even if it tries to move Pods on scale down, if the PDB blocks it, the node can’t be killed and it doesn’t shrink. The PDB operations manual is covered in Chapter 30, Upgrade strategy.
The cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation — CA does not terminate a node that has a Pod with this annotation. It’s often attached to system Pods or Pods that depend on a local disk.
The Pending reason isn’t a resource shortage — if it’s Pending because of a nodeSelector or affinity mismatch, or a PV’s AZ mismatch (the part that Chapter 9, PV / PVC / StorageClass’s WaitForFirstConsumer solves), then even bringing up more nodes leaves that Pod still Pending. It’s outside CA’s responsibility.

Looking at the events section of kubectl describe pod and the CA Pod’s logs (kubectl logs -n kube-system -l app=cluster-autoscaler) together distinguishes which reason it is.

The collaboration of the three dimensions — handling a load surge #

Let’s follow the three autoscalers together in one scenario — right after the start of a marketing campaign that brings in five times the usual traffic.

a load surge

t=0s    Campaign starts. 5x traffic comes in.
        Deployment 'web': replicas=4, requests.cpu=500m
        All Pods' CPU average 130% (vs target 70%)

t=15s   HPA collects metrics and calculates.
        desired = ceil(4 * (130/70)) = 8
        replicas: requests change from 4 -> 8.

t=20s   K8s tries to create 4 new Pods.
        But the nodes' available CPU is short.
        2 new Pods Running, 2 Pending.

t=30s   CA discovers the Pending Pods.
        Requests desired capacity +1 to the ASG.
        A new node starts booting in the cloud.

t=120s  The new node joins the cluster as Ready.
        The 2 Pods that were Pending are scheduled on that node.
        Running.

t=135s  HPA measures again. Average 80%.
        desired = ceil(8 * (80/70)) = 10. replicas: 8 -> 10.
        2 more needed — this time they fit in the new node's headroom.
        ...

When the campaign ends and traffic returns to usual, it reduces in reverse — after HPA’s scale-down stabilization window (5 minutes) the Pods slowly reduce, and CA discovers the emptying nodes and terminates them. Node termination usually starts after a low-utilization state is held for a certain time (the default is about 10 minutes), so it’s more conservative.

The key is that this cycle rolls without human intervention. But the premises needed for it to roll — the workload has requests written, metrics-server is alive, HPA’s behavior is reasonable, the ASG has CA tags, and the node instance type fits the workload — all stand on the model from Chapter 11 onward. Think of autoscaling as the last layer that rolls that model dynamically.

The operational adoption pattern — where to start #

Turning on all three autoscalers at the same time may look good, but the recommended operational flow is conservative.

Adopt HPA first — it’s the most familiar and has the least accident risk. Check whether the target workload has requests written, set minReplicas ≥ 2, and start from a standard value like CPU 70%. Observe the behavior over days · weeks and adjust behavior.
VPA only for recommended values at updateMode: "Off" — don’t turn on the Pod-recreating policy from the start. Gather the recommender’s recommended values for a few days, and once you judge them reasonable, have a human reflect them in the manifest. Moving to Auto is done only when you’re confident the eviction impact of that workload is small.
CA is nearly mandatory in a cloud environment — it’s meaningless in a learning environment like minikube · kind, but operating a cloud cluster without CA means a human must chase the nodes’ desired capacity each time. On EKS, the pattern of installing CA (or Karpenter) together from the start is standard.
Custom metric / KEDA to fit the workload’s characteristics — there’s no reason to forcibly adopt the Prometheus Adapter for a workload sufficiently expressed by CPU · memory signals. Adopt it only for workloads whose signal type differs, like queue consumers or event workers.

If we shrink this flow to one line — HPA is a default on nearly all workloads, VPA starts as a recommended-value tool, CA is mandatory if cloud, and KEDA goes where it’s needed.

Exercises #

After applying the body’s hpa-cpu.yaml, run a load test with kubectl run load-gen --rm -it --image=busybox -- /bin/sh. Watching kubectl get hpa -w and kubectl get pods -w together in another terminal, record in time order the utilization change in the TARGETS column and how REPLICAS grows per the proportion. Organize in one paragraph how the time it takes for scale down after you stop the load meshes with the 5-minute stabilization window of §“The asymmetry of scale up and scale down.”
Put an HPA on a Deployment that’s missing requests. Record what value the TARGETS of kubectl get hpa shows, and what message comes out in the Events of kubectl describe hpa. Organize in one paragraph, in your own words, the behavior when the denominator (requests) noted in §“The HPA algorithm” isn’t defined, and note how it connects to the resource model of Chapter 11.
Write as a step-by-step simulation (in the form t=0, t=30s, t=60s, t=90s, etc.) how the oscillation noted in §“The conflict of HPA and VPA” is created when you put a CPU-based HPA and a CPU-based VPA (updateMode: Auto) on the same workload together. Then organize in a table at which step each of the two avoidance patterns (metric separation / VPA Off) cuts off the oscillation.

In one line: The three dimensions of automatic adjustment that absorb operational load variation without a human chasing it roll as a complementary relationship of HPA (Pod count), VPA (Pod requests / limits), and Cluster Autoscaler (node count). HPA’s proportion has requests as the denominator, so Chapter 11’s resource model is the premise, and the asymmetry of scale up immediate / scale down 5-minute stabilization is the standard shape that prevents cold-start accidents. Putting HPA · VPA on the same metric together causes oscillation, so it’s safe to start VPA as a recommended-value tool at updateMode: "Off". CA and Karpenter automatically add nodes by looking at Pending Pods.

Next chapter #

The series up through this chapter follows the model of how to run a workload. The controllers of Chapter 8, the persistent data of Chapter 9, the external entry point of Chapter 10, the resource requests of Chapter 11, the health signals of Chapter 12, and this chapter’s automatic adjustment together form the model for running a single workload on a production cluster and keeping it rolling.

The next chapter’s subject moves the viewpoint a tier — the policy of an environment where several users · several teams · several workloads share one cluster. The permission model RBAC of who can do what action on which object, the NetworkPolicy that controls Pod-to-Pod network communication with a whitelist, and the ResourceQuota and LimitRange that bound how much cluster resource one namespace can use. These three are the standard safety net of a multi-tenant · production cluster.

Chapter 14, RBAC / NetworkPolicy / ResourceQuota wraps up Part 2 by following the manifests and behavior of these three objects and the recommended operational patterns.