K8s Intermediate #4: resources.requests / limits — Pod Resource Requests and Limits
The fourth post in the K8s Intermediate series. Through #3, the viewpoint was outside the cluster — the model of how external traffic reached workloads inside via Service, Ingress, and Ingress Controller. This post returns the viewpoint to inside the Pod. The model of how the container that receives that traffic and does the work requests CPU and memory and is given upper bounds — the story of resources.requests and resources.limits. The separation of these two fields is the foundation that holds up K8s scheduling and stability at the same time.
This series is K8s Intermediate, 7 posts.
- #1 StatefulSet / DaemonSet / Job / CronJob — Controllers beyond Deployment
- #2 PV / PVC / StorageClass — the persistent data model
- #3 Ingress and Ingress Controller — the external entry point
- #4 resources.requests / limits — Pod resource requests and limits ← this post
- #5 Health checks — liveness / readiness / startup probes
- #6 Autoscaling — HPA / VPA / Cluster Autoscaler
- #7 RBAC / NetworkPolicy / ResourceQuota — security and resource policy
requests and limits — the two values play different roles #
There are two fields in the K8s manifest that express the resource model. A single container’s resources.requests and resources.limits. The two look similar but the actors and the moments at which they are observed are completely different.
| Field | Observer | When observed | Meaning |
|---|---|---|---|
resources.requests | Scheduler (kube-scheduler) | When deciding which node to place the Pod on | The minimum resource that must be guaranteed for this container to run |
resources.limits | kubelet (cgroup) | While the container is actually running | The hard cap this container can never exceed |
When deciding where to place a new Pod, the scheduler subtracts the sum of requests of Pods already running from the candidate node’s allocatable resources (total node resources minus the share for system daemons and kubelet). Only nodes where the new Pod’s requests fits within the remainder become candidates. limits is not part of this decision. Even when the sum of limits on a node exceeds allocatable, K8s will still schedule a Pod on it — this is called overcommit, and it’s the default behavior to use node resources statistically efficiently.
limits works in the next layer. Once the Pod is placed on a node and the container starts up, kubelet sets the limits values into the container’s cgroup. When the container tries to exceed that bound, the Linux kernel forcibly stops it. The behavior here splits by resource type — CPU is throttled (made to wait briefly without compute), memory is OOMKilled (the container is forcibly terminated). This difference is covered separately later.
Reduced to one mental line — requests is “the amount that must be guaranteed,” seen by scheduling; limits is “the amount that must never be exceeded,” enforced by runtime. That these two values can be set differently is the core of K8s’s resource model.
Units of CPU and memory #
A frequently confusing part of the manifest is unit notation. CPU and memory each use different notation.
CPU — cores and millicores #
CPU is in core units. 1 means 1 core, 2 means 2 cores. To slice smaller than a core, use millicore notation.
| Notation | Meaning |
|---|---|
1 | 1 core (1000 millicores) |
500m | 0.5 cores |
250m | 0.25 cores |
100m | 0.1 cores |
0.5 | Same as 500m |
Operational manifests often use millicore integers like 100m and 250m. Decimal notation like 0.1 leaves room for confusion at the YAML parse step, so it’s a pattern to avoid. CPU units map directly to the container cgroup’s CPU quota — 100m means 10ms of CPU time per 100ms period.
Memory — binary vs decimal #
Memory has two families of unit suffixes, and the difference between them is a regular cause of operational confusion.
| Notation | Value | Note |
|---|---|---|
1Ki | 1024 bytes | Binary |
1Mi | 1024 KiB = 1,048,576 bytes | Binary |
1Gi | 1024 MiB = 1,073,741,824 bytes | Binary |
1K | 1000 bytes | Decimal |
1M | 1,000,000 bytes | Decimal |
1G | 1,000,000,000 bytes | Decimal |
1Gi and 1G differ by about 7% (1GiB is bigger). The standard for operational manifests is the binary suffix (Mi, Gi). Container runtimes and the OS treat memory in binary units, and tools like kubectl top display values in binary as well. Writing 1G and seeing usage shown as 0.93Gi comes from this unit mismatch.
A single manifest #
Applying the two units above directly into a manifest. Putting a resources key into the container definition inside the Pod template of a Deployment is the standard shape.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: app
image: myapp/web:1.4.0
ports:
- containerPort: 8080
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"This container will only be scheduled on a node where 0.25 cores and 512 MiB of memory are available to guarantee. The scheduler accepts new Pods only on nodes with that much headroom. While running, it can use up to 1 core and 1 GiB of memory, and any attempt to exceed that is forcibly stopped by cgroup.
The resources field is per-container — if a Pod has multiple containers, write it on each container separately. Pod-level requests / limits are computed as the sum of the containers. If you have a sidecar container (e.g., a log collector), don’t forget to put small requests / limits on that container too.
kubectl top pod
kubectl top pod -n <namespace> --containersNAME CPU(cores) MEMORY(bytes)
web-7d4f8b9c5-abc12 180m 420Mi
web-7d4f8b9c5-def34 220m 480Mikubectl top requires metrics-server to be installed in the cluster. The displayed values are the actual usage from the container cgroup, so the units are binary.
The four combinations of requests / limits #
How you write the two in the manifest greatly changes behavior. Putting the four combinations in one table:
| Combination | Behavior | QoS | Operational fit |
|---|---|---|---|
| Both written | Safest. Both scheduling guarantee and runtime cap are clear | Guaranteed if requests = limits, otherwise Burstable | Recommended |
Only requests | Has scheduling guarantee but no runtime cap. Container may potentially occupy the node’s entire resources | Burstable | Only when limits has to be omitted |
Only limits | K8s treats it as requests = limits. Resulting in the most conservative shape | Guaranteed | Fine, but explicit writing is recommended |
| Neither | No subtraction at scheduling. No runtime cap | BestEffort | Not recommended |
The most common trap is omitting both. That container becomes BestEffort QoS and is the first eviction target when the node faces resource pressure. The scheduler also treats this Pod’s resources as 0 when picking a node, so a single node can end up packed with BestEffort Pods. In production manifests, always writing requests / limits — even small ones — is the safer practice.
Writing only requests and omitting limits is done only with clear intent — because CPU throttling inflates response latency, some operators intentionally omit CPU limits (more on this below). Memory limits, however, are almost always set, because leaving them out lets buggy code consume the entire node’s memory.
QoS classes — Guaranteed / Burstable / BestEffort #
K8s classifies Pods into three QoS classes based on their requests / limits shape. This classification decides who gets evicted first under node resource pressure.
| QoS | Condition | Eviction priority |
|---|---|---|
Guaranteed | All containers have requests == limits for all resources, both specified | Last (safest) |
Burstable | Only requests, requests / limits differ, or only some containers have them | Middle |
BestEffort | No container has requests or limits | First (most at risk) |
Eviction is the action where kubelet forcibly terminates Pods to reclaim resources when signals like memory or disk pressure on the node exceed thresholds. Candidates are picked in the order BestEffort → Burstable → Guaranteed. Within the same tier, Pods using more resources become candidates first.
kubectl get pod web-7d4f8b9c5-abc12 -o jsonpath='{.status.qosClass}'BurstableStandard operational patterns:
- Stateful core workloads like DBs and message queues — set as Guaranteed. Set requests = limits to minimize eviction probability.
- General stateless web / API servers — Burstable. Usual usage as requests, burstable cap as limits.
- Batch / temporary workloads — Burstable or BestEffort. Workloads that may yield first when cluster resources run short.
There’s almost no case for using BestEffort in operation, but a temporary Pod for short-term debugging fits there.
CPU limit’s trap — throttling #
From here, this is a common source of operational incidents. The behavior of CPU and memory when limits are exceeded is completely different.
CPU limit is enforced via throttling. The container cgroup’s CPU quota allocates only the limits-equivalent per cycle (usually 100ms), and once the container uses it all, it doesn’t get compute until the next cycle. The container doesn’t die — it just pauses briefly and wakes up at the next cycle.
For example, suppose a container has cpu: limits: 100m. It can use only 10ms of CPU time per 100ms period. If a single request needs 50ms of CPU, it uses the first 10ms, waits 90ms, uses 10ms again, waits 90ms, and so on. Work that would have taken 50ms on an unthrottled CPU takes around 410ms.
The most common operational symptom of this behavior is response latency spikes. Average CPU utilization looks well below the limit, but p99 response time suddenly jumps. A short burst hit the limit, throttling kicked in. You can check accumulated throttle time via kubectl describe node or cAdvisor metrics (container_cpu_cfs_throttled_seconds_total).
Because of this cost, some teams adopt the operational pattern of intentionally omitting CPU limits. Lock down only the guaranteed amount via requests and let the workload burst freely when the node has headroom. This pattern is appropriate when both of the following hold:
- The node has plenty of resource headroom, and workloads bursting against each other moderately doesn’t shake the node.
- requests are reasonably set so that one workload’s runaway doesn’t invade another workload’s guaranteed amount.
In contrast, there’s almost no pattern of omitting memory limits — runaway memory threatens the entire node. Looking at that behavior next.
Memory limit’s trap — OOMKilled #
Memory limit isn’t throttling but a hard cap. When the container tries to allocate beyond the limit, the Linux kernel’s OOM Killer immediately terminates the container’s process. K8s detects this termination and records the container’s reason as OOMKilled.
kubectl describe pod web-7d4f8b9c5-abc12Containers:
app:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 18 May 2026 14:22:10 +0900
Finished: Mon, 18 May 2026 14:35:42 +0900
Restart Count: 3Reason: OOMKilled paired with Exit Code: 137 (SIGKILL) is the canonical shape. If Restart Count rises quickly with the same reason repeating, it signals that the memory limit is set lower than the workload’s actual usage.
What happens if you omit memory limits? The container cgroup has no cap, so buggy code (memory leaks, loading large files entirely into memory) can consume the entire node’s memory. That triggers node-level memory pressure, and other Pods on that node become eviction candidates in the order BestEffort → Burstable → Guaranteed. One workload’s bug destabilizes everything else on the same node. That’s why memory limits should always be set.
In short — exceeding CPU limits is throttling (no immediate termination); exceeding memory limits is OOMKilled (immediate termination).
JVM and Go runtime cgroup awareness #
Some runtimes need more than just writing resource limits. JVM and Go are the representative cases.
JVM #
Older JVMs read the host’s /proc/cpuinfo and /proc/meminfo directly to decide worker thread count, heap size, GC thread count, and so on. They were blind to the container’s cgroup limits — a JVM inside a container with cpu: limits: 500m would see the host’s 32 cores, spin up 32 GC threads, and promptly get throttled. This was a widespread source of incidents.
From Java 8u131+ / Java 10+, -XX:+UseContainerSupport was introduced, and from Java 10+ this option is enabled by default. With this option on, the JVM recognizes cgroup CPU and memory limits and decides thread count and heap size accordingly. If the operational container image is on an older JDK, explicitly enabling this option is safer.
Go #
The Go runtime’s GOMAXPROCS (the number of OS threads available for parallel execution) defaults to runtime.NumCPU(). But this value returns the host’s core count — the Go runtime doesn’t automatically recognize cgroup CPU limits. A Go process in a container with cpu: limits: 500m running on a 32-core host will start with GOMAXPROCS=32 and immediately start hitting throttling.
Two standard remedies:
automaxprocslibrary — importing thego.uber.org/automaxprocspackage reads cgroup CPU limits at process start and matchesGOMAXPROCSautomatically. Close to the operational standard pattern.- Manual environment variable — setting
GOMAXPROCSdirectly in the Pod manifest’senv.
env:
- name: GOMAXPROCS
value: "1"Other language runtimes can have similar traps. It’s worth verifying whether things like Node.js’s libuv thread pool size or Python’s multiprocessing.cpu_count() are reading from the host or from the cgroup.
The subtlety of measuring memory usage vs memory limits #
How you measure memory usage shapes your intuition about when OOMKilled triggers. The memory the cgroup tracks is RSS (Resident Set Size) + page cache, so heavy file I/O that fills the page cache counts toward the limit as well. The value kubectl top shows is usually working set (similar to RSS but excluding some reclaimable cache), so it may understate usage relative to what the OOM Killer sees.
If OOMKilled keeps repeating in operation, it’s safe to look in this order:
- Check the OOMKilled fact and count via the
Last Stateinkubectl describe pod. - Check normal usage via
kubectl top pod --containers. - Check time series via cAdvisor metrics or Prometheus’s
container_memory_working_set_bytesandcontainer_memory_rss. - Examine both an application-level memory leak possibility and raising limits.
LimitRange — namespace-level defaults #
Writing requests / limits in every manifest is something people easily forget. K8s provides the LimitRange object to prevent this oversight. It’s an object that locks down defaults and allowed ranges per namespace.
apiVersion: v1
kind: LimitRange
metadata:
name: default-resource-limits
namespace: dev
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
max:
cpu: "2"
memory: "2Gi"
min:
cpu: "50m"
memory: "64Mi"The meaning of each field:
| Field | Meaning |
|---|---|
default | Default automatically given when the container has no limits |
defaultRequest | Default automatically given when the container has no requests |
max | Per-container resource cap. Manifests exceeding this value are rejected |
min | Per-container resource floor. Manifests below this value are rejected |
With this LimitRange applied to the dev namespace, if someone applies a manifest missing requests / limits, K8s automatically fills in the default / defaultRequest values. Accidentally creating a BestEffort QoS Pod is blocked. Conversely, if a container demands more than max, the manifest apply itself is rejected, blocking the incident of one person’s mistake taking the entire node’s resources.
The operational pattern usually goes:
- dev namespace — small default and small max. So developers can spin things up lightly.
- stage / prod namespace — set generous defaults matching workload characteristics, but cap max so a single container can’t take the whole node.
ResourceQuota — namespace-level total caps #
If LimitRange is per-container policy, ResourceQuota is the namespace-wide total policy. An object that prevents the sum of requests / limits across all Pods in a namespace from exceeding the value.
apiVersion: v1
kind: ResourceQuota
metadata:
name: dev-quota
namespace: dev
spec:
hard:
requests.cpu: "10"
requests.memory: "20Gi"
limits.cpu: "20"
limits.memory: "40Gi"
pods: "50"With this ResourceQuota applied, the dev namespace can’t have its total requests.cpu exceed 10 cores or have more than 50 Pods. A new Pod manifest that breaks this cap is rejected at apply.
The pattern of pairing ResourceQuota with LimitRange is common — without LimitRange filling in missing requests / limits in manifests, ResourceQuota can’t compute the total. A BestEffort Pod with empty requests counts as 0 against ResourceQuota, but in practice, blocking those zero-resource Pods up front with LimitRange is the safer approach.
The full use of ResourceQuota is covered in #7 RBAC / NetworkPolicy / ResourceQuota — bundling it with security and resource policy reads more naturally.
Operational pattern — how to start and how to adjust #
There’s no way to know the exact right requests / limits values when first deploying a new workload. The pattern that has emerged in practice follows this order:
- Start with conservative requests + generous limits — the first step is estimation. Reference past data of similar workloads, or get rough numbers from local load tests. requests at 70-80% of normal usage and limits at two to three times that is a common starting point.
- Collect operational data — gather a few days of per-container CPU and memory time series from Prometheus + metrics-server, or the time series APMs like Datadog / New Relic expose. Look at p95 / p99 usage during peak traffic windows.
- Use the VPA recommender — running Vertical Pod Autoscaler with
updateMode: Offlets you receive only recommended values without actually changing resources. K8s learns workload characteristics and proposes appropriate requests / limits. VPA’s behavior is covered deeply in #6. - Adjust and redeploy — combine recommended values and monitoring data to update the manifest’s requests / limits, reflected in the next deployment. Raising requests requires new Pods to come up, so it usually flows naturally via rolling update.
Running this cycle once per workload makes a substantial difference in cluster-wide resource efficiency and stability. Rather than trying to get it perfect from day one, launching with a reasonable estimate and adjusting from real data is the faster path.
The next post’s topic, liveness / readiness probes, also intersects directly with resource pressure — when a Pod’s response slows due to throttling, or it stalls in a GC pause just before OOM, how the probe detects that state determines how quickly the workload recovers.
Summary #
The flow held in this post:
requestsandlimitsplay different roles — requests is the guaranteed amount the scheduler sees when picking a node; limits is the runtime cap kubelet enforces via cgroup. They are policies at different layers.- Units — CPU is core and millicore (
1,500m,100m). Memory binary (Mi,Gi) is the operational standard.1Giand1Gdiffer by 7%. - Four combinations — writing both is the standard. Writing only limits gets treated as requests = limits, becoming Guaranteed. Omitting both becomes BestEffort, top of the eviction list.
- QoS classes — Guaranteed (requests = limits) / Burstable (in between) / BestEffort (neither). Under resource pressure, eviction in the order BestEffort → Burstable → Guaranteed.
- Exceeding CPU limits is throttling — no immediate termination, common cause of response latency spikes. The operational pattern of intentionally omitting only CPU limits also exists.
- Exceeding memory limits is OOMKilled — immediate forced termination.
Last State: Terminated+Reason: OOMKilled+Exit Code: 137inkubectl describe podis the signal. - JVM recognizes cgroups via
-XX:+UseContainerSupport(default-on from Java 10+). Go’sGOMAXPROCSdoesn’t recognize cgroups, so theautomaxprocslibrary or manual env setting is needed. LimitRange— namespace-level defaults (default/defaultRequest) and allowed range (min/max). Auto-applied to manifests missing requests / limits.ResourceQuota— namespace-wide total cap. Pattern of pairing with LimitRange. Detailed use in #7.- Operational cycle — start with conservative requests + generous limits, adjust via monitoring and VPA recommender, reflect via redeploy.
Once this model is in hand, whenever you encounter a resources block in a manifest, you can read at a glance which QoS class that container belongs to and how it will behave under node resource pressure.
Next — Health checks (liveness / readiness / startup probes) #
What this post covered was the model of how much resource a container receives. The next post moves the viewpoint from resources to whether the container is alive — the model of how K8s figures out whether a container is operating normally and how it detects abnormal states to start recovery actions.
#5 Health checks — liveness / readiness / startup probes walks through the three kinds of probes in one cycle. liveness probe is the signal that triggers container restart, readiness probe is the signal that adds and removes from Service endpoints, and startup probe is the signal that gives a grace period to slow-starting containers. How the three probes’ responsibilities differ, when to use which of HTTP / TCP / exec for the check, what tuning parameters like initialDelaySeconds / periodSeconds / failureThreshold mean, and how it intertwines with this post’s resource model — all in the shape of one manifest.