K8s Intermediate #6: Autoscaling — HPA / VPA / Cluster Autoscaler

Infrastructure Kubernetes HPA VPA Autoscaling

Monday, April 27, 2026

22 min read

The sixth post in the K8s Intermediate series. The flow through #5 was the story of how a single Pod stands up. In #4 we set how much resource a Pod requests and how much it can use, and in #5 we expressed whether that Pod is alive, ready to take traffic, or still coming up — via three kinds of probes. The single-Pod model wraps up there. But load in an operational cluster swings well beyond the single-Pod model — lunchtime traffic doubles, in the early morning it drops to 1/10, and on the day a marketing campaign runs it can be five times normal all day long. Keeping up with those swings manually via kubectl scale deployment ... --replicas=... doesn’t last long. This post brings together in one piece the three dimensions of automatic adjustment that fill that gap — HPA / VPA / Cluster Autoscaler.

This series is K8s Intermediate, 7 posts.

#1 StatefulSet / DaemonSet / Job / CronJob — Controllers beyond Deployment
#2 PV / PVC / StorageClass — the persistent data model
#3 Ingress and Ingress Controller — the external entry point
#4 resources.requests / limits — Pod resource requests and limits
#5 Health checks — liveness / readiness / startup probes
#6 Autoscaling — HPA / VPA / Cluster Autoscaler ← this post
#7 RBAC / NetworkPolicy / ResourceQuota — security and resource policy

What autoscaling resolves #

Patterns of load swing in operational clusters are usually one of three — time-of-day swings (day vs night), event-driven spikes (campaigns, sales, news), and accumulating increases as workloads are added. Manual operation usually goes through these stages before hitting a limit:

At first, setting replicas generously is enough. Always keep about double the normal up.
Over time, you learn that the “generous value” is short in some hours and wasteful in others. Both cost and resources are leaking from both sides.
Someone writes a cron applying different replicas for three slots — weekday day, night, weekend. It runs for a while.
When a campaign comes in or an external traffic spike incident happens, a person wakes up at dawn to type kubectl scale. Soon that becomes recurring.

The way K8s expresses this problem is three dimensions of autoscalers. Each adjusts a different axis automatically.

Autoscaler	What it adjusts	Signal	Target
HPA (Horizontal Pod Autoscaler)	Pod count (`replicas`)	CPU/memory utilization, custom metrics	Deployment / StatefulSet
VPA (Vertical Pod Autoscaler)	Pod resource requests/limits (`requests` / `limits`)	Past CPU/memory usage trends	Deployment / StatefulSet
Cluster Autoscaler (CA)	Node count	Pods in `Pending` state, empty nodes	Cloud node groups (ASG / MIG / VMSS)

What matters is that the three axes are complementary. Even if HPA adds Pods, when nodes have no headroom, the new Pods stop in Pending. CA then adds more nodes. In a separate cycle, VPA surfaces a recommendation like “this workload actually needs about 1Gi of memory.” When all three run together, load swings are absorbed without human intervention.

The metrics-server precondition #

For autoscaling to run, there must be a component inside the cluster that reports current resource usage. K8s itself doesn’t hold those metrics directly. Instead it provides a standardized interface (metrics.k8s.io), and the component that fills that interface is installed separately into the cluster. The most common implementation is metrics-server.

metrics-server periodically scrapes the kubelet’s /metrics/resource endpoint on each node in the cluster and holds node and Pod CPU/memory usage in memory. kubectl top and the HPA controller query those values via API.

Check whether metrics-server is installed

kubectl top nodes
kubectl top pods -A

Output example

NAME        CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
node-1      450m         22%    1.8Gi           45%
node-2      320m         16%    1.5Gi           37%

If values appear, metrics-server is alive; messages like error: Metrics API not available mean it’s not installed or dead. Installation by environment:

Environment	metrics-server status
minikube	Activated by `minikube addons enable metrics-server`
kind	Manual installation needed (`kubectl apply -f` or Helm)
EKS	Manual installation needed. Helm or official manifest
GKE	Enabled by default
AKS	Enabled by default

EKS does not include metrics-server right after cluster creation. To use HPA/VPA, it is the first component to install. To run HPA on custom metrics like queue length or request count beyond CPU/memory, components like Prometheus + Prometheus Adapter or KEDA take that role instead of (or alongside) metrics-server — covered later.

HPA — auto-adjusting Pod count #

The autoscaler used most often and adopted first is HPA. The model where, instead of a person writing the replicas field, K8s auto-fills it by looking at the average of metrics.

HPA manifest — CPU baseline #

The simplest shape is by CPU utilization.

hpa-cpu.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Walking the key fields one by one:

apiVersion: autoscaling/v2 — HPA’s current stable version. v1 can only handle CPU and is tied to a single metric. From v2, multi-metric, custom metrics, and asymmetric scale up/down behavior (behavior) are all possible. New manifests almost always use v2.
scaleTargetRef — the target whose replicas is to be adjusted. Can attach to Deployment / StatefulSet / ReplicaSet. DaemonSet is tied to node count, so it’s not an HPA target.
minReplicas / maxReplicas — lower and upper bounds of automatic adjustment. A safety net to prevent unintentionally going to 0 or hundreds in operation. Setting the lower bound above 1 ensures availability; setting the upper bound reasonably protects cost and resources.
metrics — the array of signals to decide on. The example above is type: Resource (standard resources like CPU/memory) with target.type: Utilization and averageUtilization: 70 (70% average). Adjusts replicas so the average CPU utilization across all Pods is 70%.

One reason for setting minReplicas to 2 or more rather than 1: availability requires that another Pod be available to take traffic even when one Pod dies or is terminated during an update. In #5 we controlled traffic entry via readiness probes, but that only governs a single Pod’s readiness. The absence of the Pod itself must be covered by another Pod.

HPA algorithm — one ratio formula #

The formula by which HPA decides a new replicas value is simple.

HPA's desired replicas

desiredReplicas = ceil( currentReplicas * (currentMetricValue / targetMetricValue) )

In words — see how many times the target the current average is and scale Pod count by that ratio. Examples make it clear.

currentReplicas	currentMetric (avg CPU)	targetMetric	Calculation	New replicas
5	70%	70%	5 × 1.0 = 5	5 (unchanged)
5	140%	70%	5 × 2.0 = 10	10
5	35%	70%	5 × 0.5 = 2.5 → ceil	3
10	105%	70%	10 × 1.5 = 15	15

The definition of utilization (Utilization) in the numerator matters. CPU utilization is the ratio against the Pod’s requests. A Pod holding requests.cpu: 500m and actually using 700m has 140% utilization.

This definition creates one trap directly tied to #4 — if a workload has no resources.requests, HPA’s Utilization metric doesn’t work, because the denominator is undefined. Before adopting HPA, verify that CPU/memory requests are set on the target Deployment. Skipping this check leaves HPA stuck in unknown or <unknown> state.

If you want to run without requests, there’s a path of setting target.type to AverageValue instead of Utilization and writing an absolute value (e.g., 200m). Compare by absolute value rather than utilization. But this shape isn’t common; the operational standard is requests + Utilization.

multi-metric — looking at multiple signals together #

Putting multiple items in the metrics array, HPA computes desired replicas separately for each metric and adopts the largest of them. Looking at CPU and memory at the same time:

hpa-cpu-memory.yaml — metrics part

metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

If by CPU 5 are enough but by memory 8 are needed, HPA adopts 8. Conservative choice, matching the more burdened side.

This shape often becomes meaningful for workloads holding memory caches. When you see a pattern of CPU being idle but memory filling up, watching memory together prevents HPA from missing that signal.

Asymmetry of scale up vs scale down — behavior #

HPA doesn’t adjust smoothly at a fixed ratio every time. Left as-is, two operational problems arise — scaling Pods down too quickly when load briefly drops, causing cold-start response spikes when load rises again; and scaling Pods up too aggressively when load briefly spikes, wasting resources and cost. The field that handles both is behavior.

hpa-behavior.yaml — behavior part

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
    selectPolicy: Max
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    selectPolicy: Max

Three points to flag:

stabilizationWindowSeconds — the decision stabilization window. The default 300 seconds (5 min) for scale down is the operational core safety device. Only really shrink when CPU stays low for 5 minutes. Don’t shrink on signals that briefly drop and come back. scale up usually leaves it at 0 for immediate reaction.
policies — policies on how much to change in one round. Two kinds: Percent (ratio of current count) and Pods (absolute count), with periodSeconds as that policy’s cycle. The example above for scale up allows the larger of “100% of current (×2) or +4 Pods” every 15 seconds.
selectPolicy: Max / Min — which policy to adopt among multiple. Max is the most aggressive change, Min the most conservative.

The operational meaning of asymmetry comes down to one line — scale up fast, scale down slowly. Response time degradation on a load spike is immediately visible to users, but the cost of one or two extra Pods for a short period is negligible. Conversely, scaling down too quickly causes cold-start latency spikes that are equally visible to users. Encoding this asymmetry explicitly with behavior is the standard operational pattern.

If you don’t write behavior at all, K8s’s reasonable defaults (immediate scale up, 5-min stabilization for scale down) apply. Starting with defaults at first adoption and adjusting per workload characteristics is the usual flow.

HPA apply and behavior check #

HPA apply and status check

kubectl apply -f hpa-cpu.yaml
kubectl get hpa
kubectl describe hpa web

get hpa output example

NAME   REFERENCE        TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
web    Deployment/web   55%/70%         2         20        4          5m

55%/70% in the TARGETS column is the current avg / target. If this shows <unknown>/70%, either metrics-server isn’t alive or the target workload has no requests. Messages like FailedGetResourceMetric appear together in the events section of kubectl describe hpa.

A command to verify behavior with a load test:

Apply load to observe scale up

kubectl run load-gen --rm -it --image=busybox -- /bin/sh
# inside the container
while true; do wget -q -O- http://web.default.svc.cluster.local; done

In another terminal with kubectl get hpa -w, you’ll see REPLICAS grow per the ratio formula from the moment TARGETS exceeds 70%. Stopping the load, it slowly shrinks starting about 5 minutes later.

Custom metrics and KEDA — beyond CPU/memory #

There are workloads not sufficiently expressed by CPU/memory.

Queue consumers — workers receiving and processing messages from SQS/Kafka/RabbitMQ. Queue length is the real signal, not CPU. Workers’ CPU can be idle even while the queue piles up.
API gateways — RPS or concurrent connections are more direct signals than resource use.
Event-driven workloads — function-style workloads that run only when there’s work.

Applying only HPA’s CPU baseline to these workloads, you’re a beat behind the real inflection point of load, or you miss the signal entirely.

Prometheus Adapter #

The first path to having HPA see metrics beyond CPU/memory is Prometheus Adapter. If Prometheus is installed in the cluster and workloads expose metrics, Prometheus Adapter exposes the PromQL results from that Prometheus to K8s’s custom.metrics.k8s.io API. HPA can then use those metrics like standard metrics.

In the manifest’s metrics array, you write type: Pods or type: External to express which PromQL result to look at. Going deep is deferred to the K8s advanced track, but just to show the shape:

custom metric example — excerpt

metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"

Adjust replicas so each Pod’s average RPS is 100. The metric definition (http_requests_per_second) is written as PromQL in Prometheus Adapter’s config.

KEDA — event-driven 0→N #

KEDA (Kubernetes Event-Driven Autoscaling) is a step further. It resolves two things HPA can’t:

0 → N scaling — standard HPA’s minReplicas must be 1 or more. Pods can’t be scaled fully to 0 when there’s no work. KEDA shrinks workloads to 0 during idle queue periods, and brings up to 1 when a new message arrives. A big difference in cost.
Direct connection to diverse event sources — over 50 sources like SQS, Kafka, RabbitMQ, Redis Streams, PostgreSQL, Prometheus are built in. Without writing PromQL like Prometheus Adapter, queue-length-based scaling works with one KEDA ScaledObject manifest.

KEDA ScaledObject — SQS example excerpt

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: sqs-worker
spec:
  scaleTargetRef:
    name: sqs-worker
  minReplicaCount: 0
  maxReplicaCount: 30
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.ap-northeast-2.amazonaws.com/.../my-queue
        queueLength: "10"
        awsRegion: ap-northeast-2

KEDA internally creates a standard HPA — when a ScaledObject is applied, a corresponding HPA is auto-created and KEDA exposes external metrics to the K8s metrics API. Think of it as a convenience layer on top of standard HPA. In clusters with many queue consumers or event workers, it is becoming nearly the standard tool.

VPA — auto-adjusting Pod resource requests #

If HPA is the dimension of “how many Pods,” VPA is the dimension of “the size of one Pod.” In #4 we covered the process where a person looks at usage data and sets requests. VPA is an attempt to automate that work — it computes recommended values from the workload’s past CPU/memory usage trends and, depending on policy, applies those values by recreating Pods.

Three components — recommender / updater / admission-controller #

VPA is not a single controller but a bundle of three components.

Component	Role
recommender	Gathers metrics and computes recommended `requests` values. Records them in the VPA object’s `status.recommendation`
updater	If the recommender’s recommendation and the current Pod’s value diverge significantly, evicts the Pod (causes recreation)
admission-controller	When a new Pod is created, injects the recommended values via mutating admission webhook

The three components form a cycle — recommender computes recommendations, updater finds large discrepancies and kills Pods, and when new Pods are created, admission-controller starts them with manifests reflecting the recommendations. requests get refreshed to match actual workload usage without human intervention.

VPA manifest and updatePolicy #

vpa-web.yaml

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web
  namespace: default
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: 2
          memory: 4Gi

The three values of updateMode are the heart of operational decision-making.

updateMode	Behavior
`"Off"`	Compute recommendations only, don’t apply. A person reads them and reflects in the manifest
`"Initial"`	Apply recommendations only at the moment a new Pod is created. Already-up Pods unchanged
`"Auto"` (= Recreate)	When large discrepancies are seen, evict the Pod and recreate with the new recommendations

Auto looks like the end of automation, but you must be cautious in operation. VPA evicting a Pod means that Pod once dies and comes back up. In StatefulSets, single-replica workloads, or workloads with long startup probe times from #5, eviction directly impacts availability.

When first adopting VPA, the standard pattern is almost always to start with "Off". After collecting recommendations for days or weeks and confirming they’re reasonable, then move to Initial or Auto, or have a person incorporate those recommendations into manifests and commit.

Check VPA recommendations

kubectl describe vpa web

status.recommendation part — excerpt

Status:
  Recommendation:
    Container Recommendations:
      Container Name:  web
      Lower Bound:
        Cpu:     150m
        Memory:  256Mi
      Target:
        Cpu:     350m
        Memory:  512Mi
      Upper Bound:
        Cpu:     800m
        Memory:  1Gi

Target is the key recommended value. Lower Bound and Upper Bound can be seen as statistical confidence intervals. If this recommendation differs significantly from the current manifest’s requests, having a person review the difference and reflect it in the manifest is the conservative operational approach.

resourcePolicy’s minAllowed / maxAllowed #

In the manifest above, minAllowed and maxAllowed in resourcePolicy set upper and lower bounds on recommendations. Without this safety net, VPA can recommend requests that are too small based on off-peak values, or too large based on a transient memory leak pattern. In practice, always writing both values is recommended.

Clusters where VPA isn’t installed #

Unlike HPA’s metrics-server, VPA isn’t included in K8s itself. EKS/GKE/AKS all need separate installation — usually installed via the official GitHub manifests or Helm chart. Only GKE provides a managed option.

HPA and VPA conflict — don’t put both on the same metric #

One frequently seen trap in operation: putting CPU-based HPA and CPU-based VPA on the same workload causes oscillation. The reason is simple.

CPU load goes up. HPA scales Pod count up per the ratio formula.
With more Pods, average CPU per Pod drops.
VPA (Auto) sees the dropped usage and judges “we should reduce requests.” Lowers recommendation and recreates Pods.
With requests lowered, dividing the same usage by the smaller denominator makes utilization (Utilization) rise again. HPA scales up Pods again.

A non-stopping oscillation cycle. Avoidance patterns are two:

Separate HPA and VPA metrics — for example HPA by CPU, VPA by memory. The two cycles don’t shake each other’s denominator/numerator.
VPA at updateMode: "Off" — compute recommendations only, no automatic application. A person reviews and reflects in the manifest. HPA operates as is.

Most operational clusters use the second pattern. HPA owns dynamic load adjustment, VPA stays as a recommendation tool, and someone incorporates those recommendations into manifests roughly once a quarter. This separation is the safest starting point.

Cluster Autoscaler — adjusting at the node dimension #

Even if HPA scales Pods up, when nodes have no resources for those Pods, the Pods stop in Pending state. The schedulability formula seen in #4 — where the room left in the node’s allocatable minus already-reserved requests must be greater than or equal to the new Pod’s requests — isn’t satisfied. What fills this gap is Cluster Autoscaler.

Behavior model #

CA’s behavior is simple in two directions.

scale up — when Pending Pods are seen, calls the cloud API to add a node big enough to receive that Pod’s requests to the node group. On AWS, scales the ASG’s desired capacity up; on GCP it’s MIG, on Azure it’s VMSS.
scale down — when there’s a node with low utilization for a certain time, moves Pods on that node to others and terminates the node. If there are Pods that can’t be moved (e.g., when a PV is attached to that node only, or a Pod with the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation), that node is left.

CA’s decisions are based not on metrics but on requests and schedulability. Even if actual usage is idle, when requests totals fill the node, new Pods become Pending and CA adds nodes. This model is at exactly the same layer as the expression “requests is the scheduler’s real currency” in #4.

Node groups by cloud environment #

CA runs paired with cloud providers. Mapping by environment:

Cloud	Node group abstraction	Note
AWS EKS	Auto Scaling Group (ASG) or EKS managed node group	Separate ASG per AZ recommended
GCP GKE	Managed Instance Group (MIG)	Default-on. GKE Autopilot abstracts the node itself
Azure AKS	Virtual Machine Scale Set (VMSS)	Enabled via AKS cluster option
On-prem	Cluster API + provider	Varies by environment

For EKS, CA is usually installed via Helm chart. There’s a one-time setup of attaching appropriate tags to the ASG so CA discovers and manages it. GKE turns it on with one option line at cluster creation.

Karpenter — EKS’s faster alternative #

CA’s design goes through the cycle of “request +1 desired capacity to ASG → node created from the ASG’s launch template → kubelet registers with cluster.” The fact that the node spec is pre-defined in the ASG is a constraint — when a Pending Pod requires a lot of memory and the ASG’s instance type can only produce small nodes, the Pod remains Pending even after the new node comes up.

Karpenter is establishing itself as a faster alternative to CA in EKS. Karpenter’s differences are two:

Decides node spec dynamically by looking at Pending Pods — instead of a pre-defined ASG, picks instance types best matching Pending Pods’ requests and tolerations on the fly and spins them up directly via EC2 API.
Fast provisioning — without going through the ASG step, the time from node up to cluster join is usually shorter.

In new EKS clusters, adopting Karpenter instead of CA is increasingly common. Equivalent tools on GKE/AKS are not yet as mature as Karpenter is on EKS.

Common reasons CA doesn’t work #

A few patterns where CA doesn’t run as intended:

No cluster-autoscaler-related tags on the node — for AWS, the ASG must have tags like k8s.io/cluster-autoscaler/enabled for CA to consider it managed.
PodDisruptionBudget too strict — at scale down, if PDB blocks Pod movement, the node can’t be killed and isn’t reduced.
cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation — CA doesn’t terminate nodes with Pods carrying this annotation. Often attached to system Pods or Pods depending on local disk.
Pending reason isn’t resource shortage — if Pending due to nodeSelector or affinity mismatch, or PV’s AZ mismatch (the part WaitForFirstConsumer from #2 resolves), even adding nodes leaves the Pod still Pending. Outside CA’s responsibility.

Looking at the events section of kubectl describe pod and the CA Pod’s logs (kubectl logs -n kube-system -l app=cluster-autoscaler) together discriminates which side the cause is on.

Three-dimensional collaboration — one cycle of load spike #

Following one scenario where the three autoscalers run together — right after a marketing campaign starts that brings five times the normal traffic.

One cycle of load spike

t=0s    Campaign starts. 5x traffic enters.
        Deployment 'web': replicas=4, requests.cpu=500m
        All Pods' avg CPU 130% (target 70%)

t=15s   HPA gathers and computes metrics.
        desired = ceil(4 * (130/70)) = 8
        Requests change replicas: 4 -> 8.

t=20s   K8s tries to make 4 new Pods.
        But available CPU on nodes is short.
        2 of the new Pods Running, 2 Pending.

t=30s   CA finds Pending Pods.
        Requests +1 desired capacity to ASG.
        New node starts booting in cloud.

t=120s  New node joins cluster as Ready.
        The 2 Pending Pods schedule onto that node.
        Running.

t=135s  HPA measures again. Avg 80%.
        desired = ceil(8 * (80/70)) = 10. replicas: 8 -> 10.
        Need 2 more — this time fits in the new node's headroom.
        ...

When the campaign ends and traffic returns to normal, it shrinks in reverse — after HPA’s scale down stabilization window (5 min) Pods slowly reduce, and CA finds emptying nodes and terminates them. Node termination usually starts after low-utilization state holds for a certain time (default about 10 min), so it’s more conservative.

The key point is that this entire cycle runs without human intervention. But the preconditions for it to work — requests set on workloads, metrics-server running, HPA’s behavior tuned reasonably, ASG tagged for CA, node instance types matching the workload — all rest on the model built up from #4 onward. Autoscaling is the final layer that drives that model dynamically.

Operational adoption pattern — where to start #

It might look good to turn all three autoscalers on at once, but the operational recommended flow is conservative.

Adopt HPA first — most familiar, lowest incident risk. Confirm requests is in the target workload, set minReplicas ≥ 2, start at standard values like 70% CPU. Observe behavior over days/weeks and adjust behavior.
VPA at updateMode: "Off" with recommendations only — don’t turn on the policy of recreating Pods at first. Collect the recommender’s recommendations for a few days and once judged reasonable, have a person reflect into the manifest. Move to Auto only when confident the workload’s eviction impact is small.
CA is nearly mandatory in cloud environments — it’s meaningless in learning environments like minikube/kind, but operating in cloud clusters without CA forces a person to follow the node desired capacity each time. In EKS, the standard pattern is to install CA (or Karpenter) together from the start.
Custom metrics / KEDA per workload characteristics — there’s no need to force in Prometheus Adapter for workloads sufficiently expressed by CPU/memory signals. Adopt only for workloads where the kind of signal differs, like queue consumers or event workers.

Reducing this flow to one line — HPA is default for almost all workloads, VPA starts as a recommendation tool, CA is mandatory in cloud, KEDA where needed.

Summary #

The flow held in this post:

Three dimensions of automatic adjustment — HPA (Pod count), VPA (Pod requests/limits), CA (node count). Complementary and run simultaneously.
The metrics-server precondition — for HPA/VPA to operate, metrics-server (or Prometheus + Adapter, KEDA) must be installed. minikube via one addon line, EKS needs separate install, GKE/AKS default-on.
HPA manifest — apiVersion: autoscaling/v2. scaleTargetRef (target Deployment), minReplicas/maxReplicas (safety net), metrics (signals). CPU Utilization is the ratio against requests, so requests from #4 is the precondition.
HPA algorithm — desired = ceil(current * (currentMetric / targetMetric)). One ratio formula. multi-metric adopts the largest desired across each metric.
scale up/down asymmetry — behavior field. scale up immediate, scale down 5-min stabilization window. Prevents the incident of shrinking on a briefly-dropped signal and then cold-starting.
Custom metrics and KEDA — Prometheus Adapter exposes PromQL results to HPA. KEDA has 50+ event sources built in + 0→N scaling. Suitable for queue/event workloads.
VPA’s three components — recommender (compute recommendations), updater (Pod evict), admission-controller (inject recommendations into new Pods). updateMode is Off (recommend only) / Initial (at creation) / Auto (recreate). Operational start is almost always Off.
HPA/VPA conflict — running both on the same metric (CPU) oscillates. Avoidance is two paths — separate metrics, or leave VPA as Off and have a person reflect.
Cluster Autoscaler — when Pending Pods are seen, adds nodes to node groups (ASG / MIG / VMSS); empty nodes are terminated after a certain time. Decisions based on requests. Karpenter as the faster alternative on EKS.
Three-dimensional collaboration — on load spike, HPA scales Pods → node resource shortage → new Pods Pending → CA adds nodes. Reverse on shrinking. VPA in a separate cycle.
Operational adoption flow — HPA first, VPA at Off with recommendations only, CA nearly mandatory in cloud, KEDA only for workloads needing it.

Once this model is in hand, when operational cluster load swings, there is a layer that handles it without manual intervention. At the same time, the preconditions for that automation to work — requests being set, metrics-server running, reasonable behavior values, CA tags — are visible as one coherent bundle.

Next — RBAC / NetworkPolicy / ResourceQuota #

The series through this post has followed one complete cycle of the model of how to run a workload. #1’s controllers, #2’s persistent data, #3’s external entry point, #4’s resource requests, #5’s health signals, and this post’s automatic adjustment. Together these form one complete bundle for bringing up a workload in an operational cluster and keeping it running.

The next post moves the viewpoint up one level — policies for environments where many users, many teams, and many workloads share one cluster. The permission model RBAC for who can do what to which objects, NetworkPolicy controlling Pod-to-Pod network communication via whitelist, and ResourceQuota and LimitRange capping how much cluster resource a namespace can use. These three are the standard safety net for multi-tenant operational clusters.

#7 RBAC / NetworkPolicy / ResourceQuota — security and resource policy follows the manifests, behavior, and recommended operational patterns of these three objects in one cycle, wrapping up the K8s Intermediate series.