K8s Intermediate #4: resources.requests / limits — Pod Resource Requests and Limits

17 min read

The fourth post in the K8s Intermediate series. Through #3, the viewpoint was outside the cluster — the model of how external traffic reached workloads inside via Service, Ingress, and Ingress Controller. This post returns the viewpoint to inside the Pod. The model of how the container that receives that traffic and does the work requests CPU and memory and is given upper bounds — the story of resources.requests and resources.limits. The separation of these two fields is the foundation that holds up K8s scheduling and stability at the same time.

This series is K8s Intermediate, 7 posts.

requests and limits — the two values play different roles #

There are two fields in the K8s manifest that express the resource model. A single container’s resources.requests and resources.limits. The two look similar but the actors and the moments at which they are observed are completely different.

FieldObserverWhen observedMeaning
resources.requestsScheduler (kube-scheduler)When deciding which node to place the Pod onThe minimum resource that must be guaranteed for this container to run
resources.limitskubelet (cgroup)While the container is actually runningThe hard cap this container can never exceed

When deciding where to place a new Pod, the scheduler subtracts the sum of requests of Pods already running from the candidate node’s allocatable resources (total node resources minus the share for system daemons and kubelet). Only nodes where the new Pod’s requests fits within the remainder become candidates. limits is not part of this decision. Even when the sum of limits on a node exceeds allocatable, K8s will still schedule a Pod on it — this is called overcommit, and it’s the default behavior to use node resources statistically efficiently.

limits works in the next layer. Once the Pod is placed on a node and the container starts up, kubelet sets the limits values into the container’s cgroup. When the container tries to exceed that bound, the Linux kernel forcibly stops it. The behavior here splits by resource type — CPU is throttled (made to wait briefly without compute), memory is OOMKilled (the container is forcibly terminated). This difference is covered separately later.

Reduced to one mental line — requests is “the amount that must be guaranteed,” seen by scheduling; limits is “the amount that must never be exceeded,” enforced by runtime. That these two values can be set differently is the core of K8s’s resource model.

Units of CPU and memory #

A frequently confusing part of the manifest is unit notation. CPU and memory each use different notation.

CPU — cores and millicores #

CPU is in core units. 1 means 1 core, 2 means 2 cores. To slice smaller than a core, use millicore notation.

NotationMeaning
11 core (1000 millicores)
500m0.5 cores
250m0.25 cores
100m0.1 cores
0.5Same as 500m

Operational manifests often use millicore integers like 100m and 250m. Decimal notation like 0.1 leaves room for confusion at the YAML parse step, so it’s a pattern to avoid. CPU units map directly to the container cgroup’s CPU quota — 100m means 10ms of CPU time per 100ms period.

Memory — binary vs decimal #

Memory has two families of unit suffixes, and the difference between them is a regular cause of operational confusion.

NotationValueNote
1Ki1024 bytesBinary
1Mi1024 KiB = 1,048,576 bytesBinary
1Gi1024 MiB = 1,073,741,824 bytesBinary
1K1000 bytesDecimal
1M1,000,000 bytesDecimal
1G1,000,000,000 bytesDecimal

1Gi and 1G differ by about 7% (1GiB is bigger). The standard for operational manifests is the binary suffix (Mi, Gi). Container runtimes and the OS treat memory in binary units, and tools like kubectl top display values in binary as well. Writing 1G and seeing usage shown as 0.93Gi comes from this unit mismatch.

A single manifest #

Applying the two units above directly into a manifest. Putting a resources key into the container definition inside the Pod template of a Deployment is the standard shape.

deployment-with-resources.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: app
          image: myapp/web:1.4.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "250m"
              memory: "512Mi"
            limits:
              cpu: "1"
              memory: "1Gi"

This container will only be scheduled on a node where 0.25 cores and 512 MiB of memory are available to guarantee. The scheduler accepts new Pods only on nodes with that much headroom. While running, it can use up to 1 core and 1 GiB of memory, and any attempt to exceed that is forcibly stopped by cgroup.

The resources field is per-container — if a Pod has multiple containers, write it on each container separately. Pod-level requests / limits are computed as the sum of the containers. If you have a sidecar container (e.g., a log collector), don’t forget to put small requests / limits on that container too.

Check resource usage
kubectl top pod
kubectl top pod -n <namespace> --containers
Output example
NAME                   CPU(cores)   MEMORY(bytes)
web-7d4f8b9c5-abc12    180m         420Mi
web-7d4f8b9c5-def34    220m         480Mi

kubectl top requires metrics-server to be installed in the cluster. The displayed values are the actual usage from the container cgroup, so the units are binary.

The four combinations of requests / limits #

How you write the two in the manifest greatly changes behavior. Putting the four combinations in one table:

CombinationBehaviorQoSOperational fit
Both writtenSafest. Both scheduling guarantee and runtime cap are clearGuaranteed if requests = limits, otherwise BurstableRecommended
Only requestsHas scheduling guarantee but no runtime cap. Container may potentially occupy the node’s entire resourcesBurstableOnly when limits has to be omitted
Only limitsK8s treats it as requests = limits. Resulting in the most conservative shapeGuaranteedFine, but explicit writing is recommended
NeitherNo subtraction at scheduling. No runtime capBestEffortNot recommended

The most common trap is omitting both. That container becomes BestEffort QoS and is the first eviction target when the node faces resource pressure. The scheduler also treats this Pod’s resources as 0 when picking a node, so a single node can end up packed with BestEffort Pods. In production manifests, always writing requests / limits — even small ones — is the safer practice.

Writing only requests and omitting limits is done only with clear intent — because CPU throttling inflates response latency, some operators intentionally omit CPU limits (more on this below). Memory limits, however, are almost always set, because leaving them out lets buggy code consume the entire node’s memory.

QoS classes — Guaranteed / Burstable / BestEffort #

K8s classifies Pods into three QoS classes based on their requests / limits shape. This classification decides who gets evicted first under node resource pressure.

QoSConditionEviction priority
GuaranteedAll containers have requests == limits for all resources, both specifiedLast (safest)
BurstableOnly requests, requests / limits differ, or only some containers have themMiddle
BestEffortNo container has requests or limitsFirst (most at risk)

Eviction is the action where kubelet forcibly terminates Pods to reclaim resources when signals like memory or disk pressure on the node exceed thresholds. Candidates are picked in the order BestEffort → Burstable → Guaranteed. Within the same tier, Pods using more resources become candidates first.

Check Pod's QoS class
kubectl get pod web-7d4f8b9c5-abc12 -o jsonpath='{.status.qosClass}'
Output example
Burstable

Standard operational patterns:

  • Stateful core workloads like DBs and message queues — set as Guaranteed. Set requests = limits to minimize eviction probability.
  • General stateless web / API servers — Burstable. Usual usage as requests, burstable cap as limits.
  • Batch / temporary workloads — Burstable or BestEffort. Workloads that may yield first when cluster resources run short.

There’s almost no case for using BestEffort in operation, but a temporary Pod for short-term debugging fits there.

CPU limit’s trap — throttling #

From here, this is a common source of operational incidents. The behavior of CPU and memory when limits are exceeded is completely different.

CPU limit is enforced via throttling. The container cgroup’s CPU quota allocates only the limits-equivalent per cycle (usually 100ms), and once the container uses it all, it doesn’t get compute until the next cycle. The container doesn’t die — it just pauses briefly and wakes up at the next cycle.

For example, suppose a container has cpu: limits: 100m. It can use only 10ms of CPU time per 100ms period. If a single request needs 50ms of CPU, it uses the first 10ms, waits 90ms, uses 10ms again, waits 90ms, and so on. Work that would have taken 50ms on an unthrottled CPU takes around 410ms.

The most common operational symptom of this behavior is response latency spikes. Average CPU utilization looks well below the limit, but p99 response time suddenly jumps. A short burst hit the limit, throttling kicked in. You can check accumulated throttle time via kubectl describe node or cAdvisor metrics (container_cpu_cfs_throttled_seconds_total).

Because of this cost, some teams adopt the operational pattern of intentionally omitting CPU limits. Lock down only the guaranteed amount via requests and let the workload burst freely when the node has headroom. This pattern is appropriate when both of the following hold:

  • The node has plenty of resource headroom, and workloads bursting against each other moderately doesn’t shake the node.
  • requests are reasonably set so that one workload’s runaway doesn’t invade another workload’s guaranteed amount.

In contrast, there’s almost no pattern of omitting memory limits — runaway memory threatens the entire node. Looking at that behavior next.

Memory limit’s trap — OOMKilled #

Memory limit isn’t throttling but a hard cap. When the container tries to allocate beyond the limit, the Linux kernel’s OOM Killer immediately terminates the container’s process. K8s detects this termination and records the container’s reason as OOMKilled.

Checking OOMKilled
kubectl describe pod web-7d4f8b9c5-abc12
Output example — Last State excerpt
Containers:
  app:
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 18 May 2026 14:22:10 +0900
      Finished:     Mon, 18 May 2026 14:35:42 +0900
    Restart Count:  3

Reason: OOMKilled paired with Exit Code: 137 (SIGKILL) is the canonical shape. If Restart Count rises quickly with the same reason repeating, it signals that the memory limit is set lower than the workload’s actual usage.

What happens if you omit memory limits? The container cgroup has no cap, so buggy code (memory leaks, loading large files entirely into memory) can consume the entire node’s memory. That triggers node-level memory pressure, and other Pods on that node become eviction candidates in the order BestEffort → Burstable → Guaranteed. One workload’s bug destabilizes everything else on the same node. That’s why memory limits should always be set.

In short — exceeding CPU limits is throttling (no immediate termination); exceeding memory limits is OOMKilled (immediate termination).

JVM and Go runtime cgroup awareness #

Some runtimes need more than just writing resource limits. JVM and Go are the representative cases.

JVM #

Older JVMs read the host’s /proc/cpuinfo and /proc/meminfo directly to decide worker thread count, heap size, GC thread count, and so on. They were blind to the container’s cgroup limits — a JVM inside a container with cpu: limits: 500m would see the host’s 32 cores, spin up 32 GC threads, and promptly get throttled. This was a widespread source of incidents.

From Java 8u131+ / Java 10+, -XX:+UseContainerSupport was introduced, and from Java 10+ this option is enabled by default. With this option on, the JVM recognizes cgroup CPU and memory limits and decides thread count and heap size accordingly. If the operational container image is on an older JDK, explicitly enabling this option is safer.

Go #

The Go runtime’s GOMAXPROCS (the number of OS threads available for parallel execution) defaults to runtime.NumCPU(). But this value returns the host’s core count — the Go runtime doesn’t automatically recognize cgroup CPU limits. A Go process in a container with cpu: limits: 500m running on a 32-core host will start with GOMAXPROCS=32 and immediately start hitting throttling.

Two standard remedies:

  • automaxprocs library — importing the go.uber.org/automaxprocs package reads cgroup CPU limits at process start and matches GOMAXPROCS automatically. Close to the operational standard pattern.
  • Manual environment variable — setting GOMAXPROCS directly in the Pod manifest’s env.
Manual GOMAXPROCS for Go container
env:
  - name: GOMAXPROCS
    value: "1"

Other language runtimes can have similar traps. It’s worth verifying whether things like Node.js’s libuv thread pool size or Python’s multiprocessing.cpu_count() are reading from the host or from the cgroup.

The subtlety of measuring memory usage vs memory limits #

How you measure memory usage shapes your intuition about when OOMKilled triggers. The memory the cgroup tracks is RSS (Resident Set Size) + page cache, so heavy file I/O that fills the page cache counts toward the limit as well. The value kubectl top shows is usually working set (similar to RSS but excluding some reclaimable cache), so it may understate usage relative to what the OOM Killer sees.

If OOMKilled keeps repeating in operation, it’s safe to look in this order:

  1. Check the OOMKilled fact and count via the Last State in kubectl describe pod.
  2. Check normal usage via kubectl top pod --containers.
  3. Check time series via cAdvisor metrics or Prometheus’s container_memory_working_set_bytes and container_memory_rss.
  4. Examine both an application-level memory leak possibility and raising limits.

LimitRange — namespace-level defaults #

Writing requests / limits in every manifest is something people easily forget. K8s provides the LimitRange object to prevent this oversight. It’s an object that locks down defaults and allowed ranges per namespace.

limitrange-default.yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: default-resource-limits
  namespace: dev
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "2"
        memory: "2Gi"
      min:
        cpu: "50m"
        memory: "64Mi"

The meaning of each field:

FieldMeaning
defaultDefault automatically given when the container has no limits
defaultRequestDefault automatically given when the container has no requests
maxPer-container resource cap. Manifests exceeding this value are rejected
minPer-container resource floor. Manifests below this value are rejected

With this LimitRange applied to the dev namespace, if someone applies a manifest missing requests / limits, K8s automatically fills in the default / defaultRequest values. Accidentally creating a BestEffort QoS Pod is blocked. Conversely, if a container demands more than max, the manifest apply itself is rejected, blocking the incident of one person’s mistake taking the entire node’s resources.

The operational pattern usually goes:

  • dev namespace — small default and small max. So developers can spin things up lightly.
  • stage / prod namespace — set generous defaults matching workload characteristics, but cap max so a single container can’t take the whole node.

ResourceQuota — namespace-level total caps #

If LimitRange is per-container policy, ResourceQuota is the namespace-wide total policy. An object that prevents the sum of requests / limits across all Pods in a namespace from exceeding the value.

resourcequota-dev.yaml — short example
apiVersion: v1
kind: ResourceQuota
metadata:
  name: dev-quota
  namespace: dev
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"
    pods: "50"

With this ResourceQuota applied, the dev namespace can’t have its total requests.cpu exceed 10 cores or have more than 50 Pods. A new Pod manifest that breaks this cap is rejected at apply.

The pattern of pairing ResourceQuota with LimitRange is common — without LimitRange filling in missing requests / limits in manifests, ResourceQuota can’t compute the total. A BestEffort Pod with empty requests counts as 0 against ResourceQuota, but in practice, blocking those zero-resource Pods up front with LimitRange is the safer approach.

The full use of ResourceQuota is covered in #7 RBAC / NetworkPolicy / ResourceQuota — bundling it with security and resource policy reads more naturally.

Operational pattern — how to start and how to adjust #

There’s no way to know the exact right requests / limits values when first deploying a new workload. The pattern that has emerged in practice follows this order:

  1. Start with conservative requests + generous limits — the first step is estimation. Reference past data of similar workloads, or get rough numbers from local load tests. requests at 70-80% of normal usage and limits at two to three times that is a common starting point.
  2. Collect operational data — gather a few days of per-container CPU and memory time series from Prometheus + metrics-server, or the time series APMs like Datadog / New Relic expose. Look at p95 / p99 usage during peak traffic windows.
  3. Use the VPA recommender — running Vertical Pod Autoscaler with updateMode: Off lets you receive only recommended values without actually changing resources. K8s learns workload characteristics and proposes appropriate requests / limits. VPA’s behavior is covered deeply in #6.
  4. Adjust and redeploy — combine recommended values and monitoring data to update the manifest’s requests / limits, reflected in the next deployment. Raising requests requires new Pods to come up, so it usually flows naturally via rolling update.

Running this cycle once per workload makes a substantial difference in cluster-wide resource efficiency and stability. Rather than trying to get it perfect from day one, launching with a reasonable estimate and adjusting from real data is the faster path.

The next post’s topic, liveness / readiness probes, also intersects directly with resource pressure — when a Pod’s response slows due to throttling, or it stalls in a GC pause just before OOM, how the probe detects that state determines how quickly the workload recovers.

Summary #

The flow held in this post:

  • requests and limits play different roles — requests is the guaranteed amount the scheduler sees when picking a node; limits is the runtime cap kubelet enforces via cgroup. They are policies at different layers.
  • Units — CPU is core and millicore (1, 500m, 100m). Memory binary (Mi, Gi) is the operational standard. 1Gi and 1G differ by 7%.
  • Four combinations — writing both is the standard. Writing only limits gets treated as requests = limits, becoming Guaranteed. Omitting both becomes BestEffort, top of the eviction list.
  • QoS classes — Guaranteed (requests = limits) / Burstable (in between) / BestEffort (neither). Under resource pressure, eviction in the order BestEffort → Burstable → Guaranteed.
  • Exceeding CPU limits is throttling — no immediate termination, common cause of response latency spikes. The operational pattern of intentionally omitting only CPU limits also exists.
  • Exceeding memory limits is OOMKilled — immediate forced termination. Last State: Terminated + Reason: OOMKilled + Exit Code: 137 in kubectl describe pod is the signal.
  • JVM recognizes cgroups via -XX:+UseContainerSupport (default-on from Java 10+). Go’s GOMAXPROCS doesn’t recognize cgroups, so the automaxprocs library or manual env setting is needed.
  • LimitRange — namespace-level defaults (default / defaultRequest) and allowed range (min / max). Auto-applied to manifests missing requests / limits.
  • ResourceQuota — namespace-wide total cap. Pattern of pairing with LimitRange. Detailed use in #7.
  • Operational cycle — start with conservative requests + generous limits, adjust via monitoring and VPA recommender, reflect via redeploy.

Once this model is in hand, whenever you encounter a resources block in a manifest, you can read at a glance which QoS class that container belongs to and how it will behave under node resource pressure.

Next — Health checks (liveness / readiness / startup probes) #

What this post covered was the model of how much resource a container receives. The next post moves the viewpoint from resources to whether the container is alive — the model of how K8s figures out whether a container is operating normally and how it detects abnormal states to start recovery actions.

#5 Health checks — liveness / readiness / startup probes walks through the three kinds of probes in one cycle. liveness probe is the signal that triggers container restart, readiness probe is the signal that adds and removes from Service endpoints, and startup probe is the signal that gives a grace period to slow-starting containers. How the three probes’ responsibilities differ, when to use which of HTTP / TCP / exec for the check, what tuning parameters like initialDelaySeconds / periodSeconds / failureThreshold mean, and how it intertwines with this post’s resource model — all in the shape of one manifest.

X