K8s Intermediate #6: Autoscaling — HPA / VPA / Cluster Autoscaler
The sixth post in the K8s Intermediate series. The flow through #5 was the story of how a single Pod stands up. In #4 we set how much resource a Pod requests and how much it can use, and in #5 we expressed whether that Pod is alive, ready to take traffic, or still coming up — via three kinds of probes. The single-Pod model wraps up there. But load in an operational cluster swings well beyond the single-Pod model — lunchtime traffic doubles, in the early morning it drops to 1/10, and on the day a marketing campaign runs it can be five times normal all day long. Keeping up with those swings manually via kubectl scale deployment ... --replicas=... doesn’t last long. This post brings together in one piece the three dimensions of automatic adjustment that fill that gap — HPA / VPA / Cluster Autoscaler.
This series is K8s Intermediate, 7 posts.
- #1 StatefulSet / DaemonSet / Job / CronJob — Controllers beyond Deployment
- #2 PV / PVC / StorageClass — the persistent data model
- #3 Ingress and Ingress Controller — the external entry point
- #4 resources.requests / limits — Pod resource requests and limits
- #5 Health checks — liveness / readiness / startup probes
- #6 Autoscaling — HPA / VPA / Cluster Autoscaler ← this post
- #7 RBAC / NetworkPolicy / ResourceQuota — security and resource policy
What autoscaling resolves #
Patterns of load swing in operational clusters are usually one of three — time-of-day swings (day vs night), event-driven spikes (campaigns, sales, news), and accumulating increases as workloads are added. Manual operation usually goes through these stages before hitting a limit:
- At first, setting
replicasgenerously is enough. Always keep about double the normal up. - Over time, you learn that the “generous value” is short in some hours and wasteful in others. Both cost and resources are leaking from both sides.
- Someone writes a cron applying different
replicasfor three slots — weekday day, night, weekend. It runs for a while. - When a campaign comes in or an external traffic spike incident happens, a person wakes up at dawn to type
kubectl scale. Soon that becomes recurring.
The way K8s expresses this problem is three dimensions of autoscalers. Each adjusts a different axis automatically.
| Autoscaler | What it adjusts | Signal | Target |
|---|---|---|---|
| HPA (Horizontal Pod Autoscaler) | Pod count (replicas) | CPU/memory utilization, custom metrics | Deployment / StatefulSet |
| VPA (Vertical Pod Autoscaler) | Pod resource requests/limits (requests / limits) | Past CPU/memory usage trends | Deployment / StatefulSet |
| Cluster Autoscaler (CA) | Node count | Pods in Pending state, empty nodes | Cloud node groups (ASG / MIG / VMSS) |
What matters is that the three axes are complementary. Even if HPA adds Pods, when nodes have no headroom, the new Pods stop in Pending. CA then adds more nodes. In a separate cycle, VPA surfaces a recommendation like “this workload actually needs about 1Gi of memory.” When all three run together, load swings are absorbed without human intervention.
The metrics-server precondition #
For autoscaling to run, there must be a component inside the cluster that reports current resource usage. K8s itself doesn’t hold those metrics directly. Instead it provides a standardized interface (metrics.k8s.io), and the component that fills that interface is installed separately into the cluster. The most common implementation is metrics-server.
metrics-server periodically scrapes the kubelet’s /metrics/resource endpoint on each node in the cluster and holds node and Pod CPU/memory usage in memory. kubectl top and the HPA controller query those values via API.
kubectl top nodes
kubectl top pods -ANAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
node-1 450m 22% 1.8Gi 45%
node-2 320m 16% 1.5Gi 37%If values appear, metrics-server is alive; messages like error: Metrics API not available mean it’s not installed or dead. Installation by environment:
| Environment | metrics-server status |
|---|---|
| minikube | Activated by minikube addons enable metrics-server |
| kind | Manual installation needed (kubectl apply -f or Helm) |
| EKS | Manual installation needed. Helm or official manifest |
| GKE | Enabled by default |
| AKS | Enabled by default |
EKS does not include metrics-server right after cluster creation. To use HPA/VPA, it is the first component to install. To run HPA on custom metrics like queue length or request count beyond CPU/memory, components like Prometheus + Prometheus Adapter or KEDA take that role instead of (or alongside) metrics-server — covered later.
HPA — auto-adjusting Pod count #
The autoscaler used most often and adopted first is HPA. The model where, instead of a person writing the replicas field, K8s auto-fills it by looking at the average of metrics.
HPA manifest — CPU baseline #
The simplest shape is by CPU utilization.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70Walking the key fields one by one:
apiVersion: autoscaling/v2— HPA’s current stable version.v1can only handle CPU and is tied to a single metric. Fromv2, multi-metric, custom metrics, and asymmetric scale up/down behavior (behavior) are all possible. New manifests almost always usev2.scaleTargetRef— the target whosereplicasis to be adjusted. Can attach to Deployment / StatefulSet / ReplicaSet. DaemonSet is tied to node count, so it’s not an HPA target.minReplicas/maxReplicas— lower and upper bounds of automatic adjustment. A safety net to prevent unintentionally going to 0 or hundreds in operation. Setting the lower bound above 1 ensures availability; setting the upper bound reasonably protects cost and resources.metrics— the array of signals to decide on. The example above istype: Resource(standard resources like CPU/memory) withtarget.type: UtilizationandaverageUtilization: 70(70% average). Adjustsreplicasso the average CPU utilization across all Pods is 70%.
One reason for setting minReplicas to 2 or more rather than 1: availability requires that another Pod be available to take traffic even when one Pod dies or is terminated during an update. In #5 we controlled traffic entry via readiness probes, but that only governs a single Pod’s readiness. The absence of the Pod itself must be covered by another Pod.
HPA algorithm — one ratio formula #
The formula by which HPA decides a new replicas value is simple.
desiredReplicas = ceil( currentReplicas * (currentMetricValue / targetMetricValue) )In words — see how many times the target the current average is and scale Pod count by that ratio. Examples make it clear.
| currentReplicas | currentMetric (avg CPU) | targetMetric | Calculation | New replicas |
|---|---|---|---|---|
| 5 | 70% | 70% | 5 × 1.0 = 5 | 5 (unchanged) |
| 5 | 140% | 70% | 5 × 2.0 = 10 | 10 |
| 5 | 35% | 70% | 5 × 0.5 = 2.5 → ceil | 3 |
| 10 | 105% | 70% | 10 × 1.5 = 15 | 15 |
The definition of utilization (Utilization) in the numerator matters. CPU utilization is the ratio against the Pod’s requests. A Pod holding requests.cpu: 500m and actually using 700m has 140% utilization.
This definition creates one trap directly tied to #4 — if a workload has no resources.requests, HPA’s Utilization metric doesn’t work, because the denominator is undefined. Before adopting HPA, verify that CPU/memory requests are set on the target Deployment. Skipping this check leaves HPA stuck in unknown or <unknown> state.
If you want to run without requests, there’s a path of setting target.type to AverageValue instead of Utilization and writing an absolute value (e.g., 200m). Compare by absolute value rather than utilization. But this shape isn’t common; the operational standard is requests + Utilization.
multi-metric — looking at multiple signals together #
Putting multiple items in the metrics array, HPA computes desired replicas separately for each metric and adopts the largest of them. Looking at CPU and memory at the same time:
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80If by CPU 5 are enough but by memory 8 are needed, HPA adopts 8. Conservative choice, matching the more burdened side.
This shape often becomes meaningful for workloads holding memory caches. When you see a pattern of CPU being idle but memory filling up, watching memory together prevents HPA from missing that signal.
Asymmetry of scale up vs scale down — behavior #
HPA doesn’t adjust smoothly at a fixed ratio every time. Left as-is, two operational problems arise — scaling Pods down too quickly when load briefly drops, causing cold-start response spikes when load rises again; and scaling Pods up too aggressively when load briefly spikes, wasting resources and cost. The field that handles both is behavior.
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
selectPolicy: MaxThree points to flag:
stabilizationWindowSeconds— the decision stabilization window. The default 300 seconds (5 min) for scale down is the operational core safety device. Only really shrink when CPU stays low for 5 minutes. Don’t shrink on signals that briefly drop and come back. scale up usually leaves it at 0 for immediate reaction.policies— policies on how much to change in one round. Two kinds:Percent(ratio of current count) andPods(absolute count), withperiodSecondsas that policy’s cycle. The example above for scale up allows the larger of “100% of current (×2) or +4 Pods” every 15 seconds.selectPolicy: Max/Min— which policy to adopt among multiple.Maxis the most aggressive change,Minthe most conservative.
The operational meaning of asymmetry comes down to one line — scale up fast, scale down slowly. Response time degradation on a load spike is immediately visible to users, but the cost of one or two extra Pods for a short period is negligible. Conversely, scaling down too quickly causes cold-start latency spikes that are equally visible to users. Encoding this asymmetry explicitly with behavior is the standard operational pattern.
If you don’t write behavior at all, K8s’s reasonable defaults (immediate scale up, 5-min stabilization for scale down) apply. Starting with defaults at first adoption and adjusting per workload characteristics is the usual flow.
HPA apply and behavior check #
kubectl apply -f hpa-cpu.yaml
kubectl get hpa
kubectl describe hpa webNAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
web Deployment/web 55%/70% 2 20 4 5m55%/70% in the TARGETS column is the current avg / target. If this shows <unknown>/70%, either metrics-server isn’t alive or the target workload has no requests. Messages like FailedGetResourceMetric appear together in the events section of kubectl describe hpa.
A command to verify behavior with a load test:
kubectl run load-gen --rm -it --image=busybox -- /bin/sh
# inside the container
while true; do wget -q -O- http://web.default.svc.cluster.local; doneIn another terminal with kubectl get hpa -w, you’ll see REPLICAS grow per the ratio formula from the moment TARGETS exceeds 70%. Stopping the load, it slowly shrinks starting about 5 minutes later.
Custom metrics and KEDA — beyond CPU/memory #
There are workloads not sufficiently expressed by CPU/memory.
- Queue consumers — workers receiving and processing messages from SQS/Kafka/RabbitMQ. Queue length is the real signal, not CPU. Workers’ CPU can be idle even while the queue piles up.
- API gateways — RPS or concurrent connections are more direct signals than resource use.
- Event-driven workloads — function-style workloads that run only when there’s work.
Applying only HPA’s CPU baseline to these workloads, you’re a beat behind the real inflection point of load, or you miss the signal entirely.
Prometheus Adapter #
The first path to having HPA see metrics beyond CPU/memory is Prometheus Adapter. If Prometheus is installed in the cluster and workloads expose metrics, Prometheus Adapter exposes the PromQL results from that Prometheus to K8s’s custom.metrics.k8s.io API. HPA can then use those metrics like standard metrics.
In the manifest’s metrics array, you write type: Pods or type: External to express which PromQL result to look at. Going deep is deferred to the K8s advanced track, but just to show the shape:
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"Adjust replicas so each Pod’s average RPS is 100. The metric definition (http_requests_per_second) is written as PromQL in Prometheus Adapter’s config.
KEDA — event-driven 0→N #
KEDA (Kubernetes Event-Driven Autoscaling) is a step further. It resolves two things HPA can’t:
- 0 → N scaling — standard HPA’s
minReplicasmust be 1 or more. Pods can’t be scaled fully to 0 when there’s no work. KEDA shrinks workloads to 0 during idle queue periods, and brings up to 1 when a new message arrives. A big difference in cost. - Direct connection to diverse event sources — over 50 sources like SQS, Kafka, RabbitMQ, Redis Streams, PostgreSQL, Prometheus are built in. Without writing PromQL like Prometheus Adapter, queue-length-based scaling works with one KEDA
ScaledObjectmanifest.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: sqs-worker
spec:
scaleTargetRef:
name: sqs-worker
minReplicaCount: 0
maxReplicaCount: 30
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.ap-northeast-2.amazonaws.com/.../my-queue
queueLength: "10"
awsRegion: ap-northeast-2KEDA internally creates a standard HPA — when a ScaledObject is applied, a corresponding HPA is auto-created and KEDA exposes external metrics to the K8s metrics API. Think of it as a convenience layer on top of standard HPA. In clusters with many queue consumers or event workers, it is becoming nearly the standard tool.
VPA — auto-adjusting Pod resource requests #
If HPA is the dimension of “how many Pods,” VPA is the dimension of “the size of one Pod.” In #4 we covered the process where a person looks at usage data and sets requests. VPA is an attempt to automate that work — it computes recommended values from the workload’s past CPU/memory usage trends and, depending on policy, applies those values by recreating Pods.
Three components — recommender / updater / admission-controller #
VPA is not a single controller but a bundle of three components.
| Component | Role |
|---|---|
| recommender | Gathers metrics and computes recommended requests values. Records them in the VPA object’s status.recommendation |
| updater | If the recommender’s recommendation and the current Pod’s value diverge significantly, evicts the Pod (causes recreation) |
| admission-controller | When a new Pod is created, injects the recommended values via mutating admission webhook |
The three components form a cycle — recommender computes recommendations, updater finds large discrepancies and kills Pods, and when new Pods are created, admission-controller starts them with manifests reflecting the recommendations. requests get refreshed to match actual workload usage without human intervention.
VPA manifest and updatePolicy #
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web
namespace: default
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web
updatePolicy:
updateMode: "Off"
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: 2
memory: 4GiThe three values of updateMode are the heart of operational decision-making.
| updateMode | Behavior |
|---|---|
"Off" | Compute recommendations only, don’t apply. A person reads them and reflects in the manifest |
"Initial" | Apply recommendations only at the moment a new Pod is created. Already-up Pods unchanged |
"Auto" (= Recreate) | When large discrepancies are seen, evict the Pod and recreate with the new recommendations |
Auto looks like the end of automation, but you must be cautious in operation. VPA evicting a Pod means that Pod once dies and comes back up. In StatefulSets, single-replica workloads, or workloads with long startup probe times from #5, eviction directly impacts availability.
When first adopting VPA, the standard pattern is almost always to start with "Off". After collecting recommendations for days or weeks and confirming they’re reasonable, then move to Initial or Auto, or have a person incorporate those recommendations into manifests and commit.
kubectl describe vpa webStatus:
Recommendation:
Container Recommendations:
Container Name: web
Lower Bound:
Cpu: 150m
Memory: 256Mi
Target:
Cpu: 350m
Memory: 512Mi
Upper Bound:
Cpu: 800m
Memory: 1GiTarget is the key recommended value. Lower Bound and Upper Bound can be seen as statistical confidence intervals. If this recommendation differs significantly from the current manifest’s requests, having a person review the difference and reflect it in the manifest is the conservative operational approach.
resourcePolicy’s minAllowed / maxAllowed #
In the manifest above, minAllowed and maxAllowed in resourcePolicy set upper and lower bounds on recommendations. Without this safety net, VPA can recommend requests that are too small based on off-peak values, or too large based on a transient memory leak pattern. In practice, always writing both values is recommended.
Clusters where VPA isn’t installed #
Unlike HPA’s metrics-server, VPA isn’t included in K8s itself. EKS/GKE/AKS all need separate installation — usually installed via the official GitHub manifests or Helm chart. Only GKE provides a managed option.
HPA and VPA conflict — don’t put both on the same metric #
One frequently seen trap in operation: putting CPU-based HPA and CPU-based VPA on the same workload causes oscillation. The reason is simple.
- CPU load goes up. HPA scales Pod count up per the ratio formula.
- With more Pods, average CPU per Pod drops.
- VPA (Auto) sees the dropped usage and judges “we should reduce requests.” Lowers recommendation and recreates Pods.
- With requests lowered, dividing the same usage by the smaller denominator makes utilization (
Utilization) rise again. HPA scales up Pods again.
A non-stopping oscillation cycle. Avoidance patterns are two:
- Separate HPA and VPA metrics — for example HPA by CPU, VPA by memory. The two cycles don’t shake each other’s denominator/numerator.
- VPA at
updateMode: "Off"— compute recommendations only, no automatic application. A person reviews and reflects in the manifest. HPA operates as is.
Most operational clusters use the second pattern. HPA owns dynamic load adjustment, VPA stays as a recommendation tool, and someone incorporates those recommendations into manifests roughly once a quarter. This separation is the safest starting point.
Cluster Autoscaler — adjusting at the node dimension #
Even if HPA scales Pods up, when nodes have no resources for those Pods, the Pods stop in Pending state. The schedulability formula seen in #4 — where the room left in the node’s allocatable minus already-reserved requests must be greater than or equal to the new Pod’s requests — isn’t satisfied. What fills this gap is Cluster Autoscaler.
Behavior model #
CA’s behavior is simple in two directions.
- scale up — when
PendingPods are seen, calls the cloud API to add a node big enough to receive that Pod’srequeststo the node group. On AWS, scales the ASG’s desired capacity up; on GCP it’s MIG, on Azure it’s VMSS. - scale down — when there’s a node with low utilization for a certain time, moves Pods on that node to others and terminates the node. If there are Pods that can’t be moved (e.g., when a PV is attached to that node only, or a Pod with the
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"annotation), that node is left.
CA’s decisions are based not on metrics but on requests and schedulability. Even if actual usage is idle, when requests totals fill the node, new Pods become Pending and CA adds nodes. This model is at exactly the same layer as the expression “requests is the scheduler’s real currency” in #4.
Node groups by cloud environment #
CA runs paired with cloud providers. Mapping by environment:
| Cloud | Node group abstraction | Note |
|---|---|---|
| AWS EKS | Auto Scaling Group (ASG) or EKS managed node group | Separate ASG per AZ recommended |
| GCP GKE | Managed Instance Group (MIG) | Default-on. GKE Autopilot abstracts the node itself |
| Azure AKS | Virtual Machine Scale Set (VMSS) | Enabled via AKS cluster option |
| On-prem | Cluster API + provider | Varies by environment |
For EKS, CA is usually installed via Helm chart. There’s a one-time setup of attaching appropriate tags to the ASG so CA discovers and manages it. GKE turns it on with one option line at cluster creation.
Karpenter — EKS’s faster alternative #
CA’s design goes through the cycle of “request +1 desired capacity to ASG → node created from the ASG’s launch template → kubelet registers with cluster.” The fact that the node spec is pre-defined in the ASG is a constraint — when a Pending Pod requires a lot of memory and the ASG’s instance type can only produce small nodes, the Pod remains Pending even after the new node comes up.
Karpenter is establishing itself as a faster alternative to CA in EKS. Karpenter’s differences are two:
- Decides node spec dynamically by looking at Pending Pods — instead of a pre-defined ASG, picks instance types best matching Pending Pods’
requestsand tolerations on the fly and spins them up directly via EC2 API. - Fast provisioning — without going through the ASG step, the time from node up to cluster join is usually shorter.
In new EKS clusters, adopting Karpenter instead of CA is increasingly common. Equivalent tools on GKE/AKS are not yet as mature as Karpenter is on EKS.
Common reasons CA doesn’t work #
A few patterns where CA doesn’t run as intended:
- No cluster-autoscaler-related tags on the node — for AWS, the ASG must have tags like
k8s.io/cluster-autoscaler/enabledfor CA to consider it managed. - PodDisruptionBudget too strict — at scale down, if PDB blocks Pod movement, the node can’t be killed and isn’t reduced.
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"annotation — CA doesn’t terminate nodes with Pods carrying this annotation. Often attached to system Pods or Pods depending on local disk.Pendingreason isn’t resource shortage — if Pending due tonodeSelectororaffinitymismatch, or PV’s AZ mismatch (the partWaitForFirstConsumerfrom #2 resolves), even adding nodes leaves the Pod still Pending. Outside CA’s responsibility.
Looking at the events section of kubectl describe pod and the CA Pod’s logs (kubectl logs -n kube-system -l app=cluster-autoscaler) together discriminates which side the cause is on.
Three-dimensional collaboration — one cycle of load spike #
Following one scenario where the three autoscalers run together — right after a marketing campaign starts that brings five times the normal traffic.
t=0s Campaign starts. 5x traffic enters.
Deployment 'web': replicas=4, requests.cpu=500m
All Pods' avg CPU 130% (target 70%)
t=15s HPA gathers and computes metrics.
desired = ceil(4 * (130/70)) = 8
Requests change replicas: 4 -> 8.
t=20s K8s tries to make 4 new Pods.
But available CPU on nodes is short.
2 of the new Pods Running, 2 Pending.
t=30s CA finds Pending Pods.
Requests +1 desired capacity to ASG.
New node starts booting in cloud.
t=120s New node joins cluster as Ready.
The 2 Pending Pods schedule onto that node.
Running.
t=135s HPA measures again. Avg 80%.
desired = ceil(8 * (80/70)) = 10. replicas: 8 -> 10.
Need 2 more — this time fits in the new node's headroom.
...When the campaign ends and traffic returns to normal, it shrinks in reverse — after HPA’s scale down stabilization window (5 min) Pods slowly reduce, and CA finds emptying nodes and terminates them. Node termination usually starts after low-utilization state holds for a certain time (default about 10 min), so it’s more conservative.
The key point is that this entire cycle runs without human intervention. But the preconditions for it to work — requests set on workloads, metrics-server running, HPA’s behavior tuned reasonably, ASG tagged for CA, node instance types matching the workload — all rest on the model built up from #4 onward. Autoscaling is the final layer that drives that model dynamically.
Operational adoption pattern — where to start #
It might look good to turn all three autoscalers on at once, but the operational recommended flow is conservative.
- Adopt HPA first — most familiar, lowest incident risk. Confirm
requestsis in the target workload, setminReplicas≥ 2, start at standard values like 70% CPU. Observe behavior over days/weeks and adjustbehavior. - VPA at
updateMode: "Off"with recommendations only — don’t turn on the policy of recreating Pods at first. Collect the recommender’s recommendations for a few days and once judged reasonable, have a person reflect into the manifest. Move toAutoonly when confident the workload’s eviction impact is small. - CA is nearly mandatory in cloud environments — it’s meaningless in learning environments like minikube/kind, but operating in cloud clusters without CA forces a person to follow the node desired capacity each time. In EKS, the standard pattern is to install CA (or Karpenter) together from the start.
- Custom metrics / KEDA per workload characteristics — there’s no need to force in Prometheus Adapter for workloads sufficiently expressed by CPU/memory signals. Adopt only for workloads where the kind of signal differs, like queue consumers or event workers.
Reducing this flow to one line — HPA is default for almost all workloads, VPA starts as a recommendation tool, CA is mandatory in cloud, KEDA where needed.
Summary #
The flow held in this post:
- Three dimensions of automatic adjustment — HPA (Pod count), VPA (Pod requests/limits), CA (node count). Complementary and run simultaneously.
- The metrics-server precondition — for HPA/VPA to operate, metrics-server (or Prometheus + Adapter, KEDA) must be installed. minikube via one addon line, EKS needs separate install, GKE/AKS default-on.
- HPA manifest —
apiVersion: autoscaling/v2.scaleTargetRef(target Deployment),minReplicas/maxReplicas(safety net),metrics(signals). CPUUtilizationis the ratio againstrequests, sorequestsfrom #4 is the precondition. - HPA algorithm —
desired = ceil(current * (currentMetric / targetMetric)). One ratio formula. multi-metric adopts the largest desired across each metric. - scale up/down asymmetry —
behaviorfield. scale up immediate, scale down 5-min stabilization window. Prevents the incident of shrinking on a briefly-dropped signal and then cold-starting. - Custom metrics and KEDA — Prometheus Adapter exposes PromQL results to HPA. KEDA has 50+ event sources built in + 0→N scaling. Suitable for queue/event workloads.
- VPA’s three components — recommender (compute recommendations), updater (Pod evict), admission-controller (inject recommendations into new Pods).
updateModeisOff(recommend only) /Initial(at creation) /Auto(recreate). Operational start is almost alwaysOff. - HPA/VPA conflict — running both on the same metric (CPU) oscillates. Avoidance is two paths — separate metrics, or leave VPA as
Offand have a person reflect. - Cluster Autoscaler — when
PendingPods are seen, adds nodes to node groups (ASG / MIG / VMSS); empty nodes are terminated after a certain time. Decisions based onrequests. Karpenter as the faster alternative on EKS. - Three-dimensional collaboration — on load spike, HPA scales Pods → node resource shortage → new Pods Pending → CA adds nodes. Reverse on shrinking. VPA in a separate cycle.
- Operational adoption flow — HPA first, VPA at
Offwith recommendations only, CA nearly mandatory in cloud, KEDA only for workloads needing it.
Once this model is in hand, when operational cluster load swings, there is a layer that handles it without manual intervention. At the same time, the preconditions for that automation to work — requests being set, metrics-server running, reasonable behavior values, CA tags — are visible as one coherent bundle.
Next — RBAC / NetworkPolicy / ResourceQuota #
The series through this post has followed one complete cycle of the model of how to run a workload. #1’s controllers, #2’s persistent data, #3’s external entry point, #4’s resource requests, #5’s health signals, and this post’s automatic adjustment. Together these form one complete bundle for bringing up a workload in an operational cluster and keeping it running.
The next post moves the viewpoint up one level — policies for environments where many users, many teams, and many workloads share one cluster. The permission model RBAC for who can do what to which objects, NetworkPolicy controlling Pod-to-Pod network communication via whitelist, and ResourceQuota and LimitRange capping how much cluster resource a namespace can use. These three are the standard safety net for multi-tenant operational clusters.
#7 RBAC / NetworkPolicy / ResourceQuota — security and resource policy follows the manifests, behavior, and recommended operational patterns of these three objects in one cycle, wrapping up the K8s Intermediate series.