11 Chapter

resources.requests / limits

A walkthrough of the model of how a container requests CPU and memory and how it's given an upper bound. The separation of requests and limits, the QoS classes (Guaranteed · Burstable · BestEffort), the difference in behavior between CPU throttling and memory OOMKilled, the cgroup awareness of the JVM · Go runtimes, the namespace policies of LimitRange · ResourceQuota, and the operational cycle of setting initial values and adjusting them.

Up through Chapter 10, Ingress and the Ingress Controller, the viewpoint was outside the cluster — it was the model of how external traffic reaches the workloads inside the cluster, through Service, Ingress, and the Ingress Controller. This chapter’s viewpoint returns once more inside the Pod. It’s the model of how the container that receives the incoming traffic and does the work requests CPU and memory and is given an upper bound — that is, the story of resources.requests and resources.limits. The separation of these two fields is the foundation that holds up K8s’s scheduling and stability at the same time.

By the end of this chapter, you’ll be able to read in a single line what QoS a container has and how it behaves under node resource pressure when you see the resources block of a manifest.

requests and limits — the two values play different roles #

The fields that express the resource model in a K8s manifest are two. A single container’s resources.requests and resources.limits. The two look similar, but the entity that looks at them and the moment they’re looked at are completely different.

Field	Who looks at it	When it’s looked at	Meaning
`resources.requests`	The scheduler (kube-scheduler)	When deciding which node to put the Pod on	The minimum resource that must be guaranteed for this container to stay up
`resources.limits`	kubelet (cgroup)	While the container is actually running	The upper bound this container can never exceed

When the scheduler decides where to place a new Pod, it subtracts the sum of the requests of the already running Pods from the candidate nodes’ allocatable resources (the node’s total resources minus the system daemons’ and kubelet’s share). Only nodes where the new Pod’s requests fit within that remaining amount become candidates. limits do not enter this decision. Even if a node’s total limits exceed allocatable, Kubernetes still schedules the Pod there — this is called overcommit, and it is the default way to use node resources statistically efficiently.

limits works at the next layer. Once the Pod is assigned to a node and the container comes up, kubelet sets the limits value on that container’s cgroup. If the container tries to exceed that bound, the Linux kernel forcibly blocks it. At this point the behavior splits by resource type — CPU is throttling (briefly denied compute), memory is OOMKilled (the container is force-terminated). This difference is covered separately later.

Put simply, requests is “the amount that must be guaranteed,” so scheduling looks at it, and limits is “the amount that must never be exceeded,” so the runtime enforces it. That you can write these two values differently is the core of the K8s resource model.

The units of CPU and memory #

A part that often confuses people in manifests is the unit notation. CPU and memory each use a different notation.

CPU — cores and millicores #

CPU is in core units. 1 means 1 core and 2 means 2 cores. To split smaller than one core, you use millicore notation.

Notation	Meaning
`1`	1 core (1000 millicore)
`500m`	0.5 core
`250m`	0.25 core
`100m`	0.1 core
`0.5`	Same as 500m

In operational manifests, writing it as a millicore integer like 100m or 250m is common. The pattern is to avoid decimal notation like 0.1 because it leaves room for confusion at the YAML parsing step. The CPU unit maps to the container cgroup’s CPU quota — 100m means it gets 10ms of CPU time per 100ms cycle.

Memory — binary vs decimal #

Memory has two families of unit suffixes, which is a frequent cause of operational accidents.

Notation	Value	Note
`1Ki`	1024 bytes	Binary
`1Mi`	1024 KiB = 1,048,576 bytes	Binary
`1Gi`	1024 MiB = 1,073,741,824 bytes	Binary
`1K`	1000 bytes	Decimal
`1M`	1,000,000 bytes	Decimal
`1G`	1,000,000,000 bytes	Decimal

1Gi and 1G differ by about 7% (1GiB is larger). The standard for operational manifests is the binary suffixes (Mi, Gi). It’s because the unit in which the container runtime and the OS handle memory is binary, and the values that tools like kubectl top display are binary too. The accident where you wrote 1G but the usage display reads 0.93Gi comes from a unit mismatch.

A single manifest #

Let’s apply the two units above directly to a manifest. The standard shape is to put the resources key in the container definition inside the Deployment’s Pod template.

deployment-with-resources.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: app
          image: myapp/web:1.4.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "250m"
              memory: "512Mi"
            limits:
              cpu: "1"
              memory: "1Gi"

This container can stay up only if 0.25 core and 512 MiB of memory are guaranteed. When the scheduler receives a new Pod, it places it only on a node with that much headroom. While it’s up, it can use up to 1 core and 1 GiB of memory, and an attempt to exceed that bound is forcibly blocked by the cgroup.

The resources field is per container — if there are several containers in one Pod, you write it separately per container. The Pod’s overall requests / limits are computed as the sum of the containers’. If there’s a sidecar container (e.g., a log collector), don’t forget to write small requests / limits on that container too. The node agents · log collectors run as DaemonSets in Chapter 8 likewise need resource notation.

check resource usage

kubectl top pod
kubectl top pod -n <namespace> --containers

example output

NAME                   CPU(cores)   MEMORY(bytes)
web-7d4f8b9c5-abc12    180m         420Mi
web-7d4f8b9c5-def34    220m         480Mi

kubectl top works only if metrics-server is installed in the cluster. The displayed values are the actual usage of the container cgroup, so the units are binary.

The four requests / limits combinations #

The behavior splits greatly depending on how you write the two in the manifest. We organize the four combinations in one table.

Combination	Behavior	QoS	Operational suitability
Both written	The safest. Both the scheduling guarantee + the runtime upper bound are clear	Guaranteed if requests = limits, otherwise Burstable	Recommended
Only `requests` written	There’s a scheduling guarantee but no runtime upper bound. The container can potentially occupy the node’s entire resources	Burstable	Only when circumstances require dropping limits
Only `limits` written	K8s treats it as `requests = limits`. As a result the most conservative shape	Guaranteed	Fine, but writing it explicitly is recommended
Neither written	No subtraction at scheduling time. No runtime upper bound	BestEffort	Not recommended

The trap you hit most often is the case where neither is written. This container becomes BestEffort QoS, and when the node comes under resource pressure it’s the first target of eviction. The scheduler also sees this Pod’s resources as 0 when choosing a node, so it becomes easy for a bunch of BestEffort Pods to pile onto one node. In operational manifests it’s safer to always write requests / limits, however small.

The pattern of writing only requests and dropping limits is used only when the intent is clear — because CPU limits’ throttling behavior increases response latency, some operators intentionally drop limits on CPU only (more on this later). But for memory, if you drop limits, faulty code can eat up the node’s memory, so you almost always write it.

The QoS classes — Guaranteed / Burstable / BestEffort #

K8s looks at a Pod’s requests / limits shape and classifies it into three tiers of QoS class. This classification decides who gets evicted first under node resource pressure.

QoS	Condition	Eviction priority
`Guaranteed`	`requests == limits` for every resource of every container, and both specified	Last (safest)
`Burstable`	Only requests, or requests / limits differ, or written on only some containers	Middle
`BestEffort`	No requests / limits on any container	First (most at risk)

Eviction is the behavior where, when a signal like the node’s memory · disk pressure crosses a threshold, kubelet force-terminates Pods to reclaim resources. The candidates are in the order BestEffort → Burstable → Guaranteed. Within the same tier, the Pod using more resources becomes a candidate first.

check a Pod's QoS class

kubectl get pod web-7d4f8b9c5-abc12 -o jsonpath='{.status.qosClass}'

example output

Burstable

The standard operational patterns are as follows.

Stateful core workloads like a DB · message queue — set them to Guaranteed. Write requests = limits to minimize the chance of eviction.
General stateless web / API servers — Burstable. Put the amount you usually use as requests and the burstable upper bound as limits.
Batch / temporary workloads — Burstable or BestEffort. Workloads that may yield first when cluster resources run short.

There’s hardly an occasion to use BestEffort in operations, but a temporary Pod brought up for short-term debugging sits roughly in that spot.

The trap of CPU limit — throttling #

From here is a frequent part of operational accidents. The behavior of CPU and memory when they exceed limits is completely different.

A CPU limit is enforced by throttling. The container cgroup’s CPU quota is allocated only up to limits every cycle (usually 100ms), and once the container uses up that amount, it gets no compute until the next cycle comes. The container does not die — it just stalls briefly and wakes up again in the next cycle.

For example, assume there’s a container with cpu: limits: 100m. This container can get only 10ms of CPU time every 100ms cycle. But if a single request needs 50ms of CPU — that request is handled by using the first 10ms, waiting 90ms, using 10ms again, waiting 90ms, and so on. Work that would originally have finished in 50ms takes about 410ms.

The accident this behavior most commonly causes in operations is a surge in response latency. Average CPU utilization is far below limits, but the pattern where p99 response time suddenly spikes — that’s it. The moment a short-term burst hit limits, throttling kicked in. You can check the accumulated throttling time in kubectl describe node or in the cAdvisor metric (container_cpu_cfs_throttled_seconds_total). The finished diagnostic flow is organized in Chapter 27, kubectl debugging patterns.

Because of this burden, an operational pattern of intentionally dropping CPU limits also exists. You pin only the guaranteed amount with requests, and let it burst freely above that when the node has headroom. This pattern is used when the following two conditions hold.

The node’s resources have plenty of headroom, and even when workloads burst against each other at a reasonable level the node isn’t shaken.
requests are set reasonably so that one workload’s runaway doesn’t encroach on another workload’s guaranteed amount.

Conversely, there’s hardly a pattern of dropping limits for memory — a memory runaway endangers the whole node. We look at that behavior next.

The trap of memory limit — OOMKilled #

A memory limit is not throttling but a hard cap. If a container tries to allocate memory beyond the limit, the Linux kernel’s OOM Killer immediately force-terminates that container’s process. K8s detects this termination and records the container’s termination reason as OOMKilled.

check OOMKilled

kubectl describe pod web-7d4f8b9c5-abc12

example output — Last State excerpt

Containers:
  app:
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 18 May 2026 14:22:10 +0900
      Finished:     Mon, 18 May 2026 14:35:42 +0900
    Restart Count:  3

The shape where Reason: OOMKilled and Exit Code: 137 (SIGKILL) appear as a pair is typical. If Restart Count climbs quickly and the same reason repeats, it’s a signal that the memory limit is set smaller than the workload’s actual usage. The finished version of the OOMKilled diagnostic tree is organized in Chapter 27, kubectl debugging patterns.

What happens if you drop memory limits? The container cgroup ends up with no memory bound, and faulty code (a memory leak, loading a large file wholesale into memory, etc.) can eat up the node’s entire memory. Then node-level memory pressure occurs, and the other Pods on that node become eviction targets in the order BestEffort → Burstable → Guaranteed. It’s the shape where one workload’s accident shakes even the other workloads on the same node. This is exactly why you always write memory limits.

In summary — exceeding CPU limits is throttling (not immediate termination), exceeding memory limits is OOMKilled (immediate force-termination).

The cgroup awareness of the JVM and Go runtimes #

There are runtimes for which writing resource limits alone isn’t enough. The JVM and Go are the representative cases.

JVM #

The old JVM read the host’s /proc/cpuinfo and /proc/meminfo directly to decide the number of worker threads, the heap size, the number of GC threads, and so on. It couldn’t see the container’s cgroup limits — the accident where a JVM inside a container with cpu: limits: 500m saw the host’s 32 cores, created 32 GC threads, and got hit with throttling was common.

From Java 8u131+ / Java 10+, -XX:+UseContainerSupport was introduced, and from Java 10+ this option is enabled by default. When this option is on, the JVM recognizes the cgroup’s CPU · memory limits and decides the thread count and heap size. If your production container image is an old JDK, it’s safe to enable this option explicitly.

Go #

The Go runtime’s GOMAXPROCS (the number of OS threads that can run in parallel) follows runtime.NumCPU() by default. But this value returns the host’s core count — the Go runtime does not automatically recognize cgroup CPU limits. The pattern arises where a Go process in a container with cpu: limits: 500m comes up with GOMAXPROCS=32 on a 32-core host and gets hit with throttling.

There are two standard solutions.

The automaxprocs library — importing the go.uber.org/automaxprocs package reads the cgroup CPU limits at process start and adjusts GOMAXPROCS automatically. It’s a pattern close to the operational standard.
Manual environment variable — set GOMAXPROCS directly in the Pod manifest’s env.

manually set a Go container's GOMAXPROCS

env:
  - name: GOMAXPROCS
    value: "1"

Other language runtimes may have similar traps. It’s safer to check once whether parts like Node.js’s libuv thread-pool size and Python’s multiprocessing.cpu_count() are set on a host basis or a cgroup basis.

The subtlety of measuring memory usage vs memory limits #

Depending on how you measure memory usage, your intuition about the moment of OOMKilled comes out differently. The memory the cgroup sees is a value like RSS (Resident Set Size) + page cache, and if the file I/O the container handles fills the page cache, that’s included in limits too. The value kubectl top displays is usually the working set (similar to RSS but excluding some reclaimable cache), so it may not show the usage right up to OOM exactly.

If OOMKilled repeats in operations, it’s safe to look at it in the following order.

Check the fact of OOMKilled and its count with kubectl describe pod’s Last State.
Check the usual usage with kubectl top pod --containers.
Check the time series with cAdvisor metrics or Prometheus’s container_memory_working_set_bytes, container_memory_rss.
Review both the possibility of an application-level memory leak and raising limits, together.

The time-series model of step 3 is covered in Chapter 19, Observability.

LimitRange — namespace-level defaults #

Writing requests / limits one by one in every manifest is easy for a human to forget. K8s provides LimitRange as the object that prevents this forgetting. As one axis of the namespace-level policy noted in Chapter 7, Namespace and labels, it’s an object that sets defaults and an allowed range.

limitrange-default.yaml

apiVersion: v1
kind: LimitRange
metadata:
  name: default-resource-limits
  namespace: dev
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "2"
        memory: "2Gi"
      min:
        cpu: "50m"
        memory: "64Mi"

The meaning of each field is as follows.

Field	Meaning
`default`	The default `limits` automatically assigned if the container has none
`defaultRequest`	The default `requests` automatically assigned if the container has none
`max`	The upper bound of resources for a single container. A manifest exceeding this value is rejected
`min`	The lower bound of resources for a single container. A manifest below this value is rejected

With this LimitRange applied to the dev namespace, if someone applies a manifest that left out requests / limits, K8s automatically fills in the default / defaultRequest values. The accident of mistakenly creating a BestEffort QoS Pod is cut off. Conversely, if a container requires more than max, the manifest application itself is rejected, which can also prevent the accident of one person’s mistake occupying the node’s entire resources.

The operational pattern is usually this.

The dev namespace — set small defaults and a small max. Let developers bring things up lightly.
The stage · prod namespaces — set defaults generously to fit the workload characteristics, but limit max so that one container can’t occupy a whole node.

ResourceQuota — namespace-level aggregate cap #

If LimitRange is a policy per single container, then ResourceQuota is a policy for the aggregate of the whole namespace. It’s an object that prevents the sum of the requests / limits of all Pods inside one namespace from exceeding this value.

resourcequota-dev.yaml — short example

apiVersion: v1
kind: ResourceQuota
metadata:
  name: dev-quota
  namespace: dev
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"
    pods: "50"

In the dev namespace with this ResourceQuota applied, the sum of all Pods’ requests.cpu can’t exceed 10 cores, and the number of Pods can’t exceed 50. If a new Pod’s manifest breaks this bound, the apply is rejected.

There’s often a pattern of setting ResourceQuota paired with LimitRange — because if LimitRange doesn’t fill in the missing requests / limits of a manifest, ResourceQuota can’t compute the sum. ResourceQuota sees a BestEffort Pod with empty requests as 0, but in operations it’s safer when LimitRange blocks that 0 in advance.

The in-earnest use of ResourceQuota is covered in Chapter 14, RBAC / NetworkPolicy / ResourceQuota — it’s more natural to bundle it as one axis of the security and resource policy.

The operational pattern — how to start and how to adjust #

When first setting a new workload’s requests / limits, there’s no way to know the exact values. The pattern settled in operations follows this order.

Start with conservative requests + generous limits — the first step is an estimate. Reference the past data of similar workloads, or get a rough value with a local load test. requests at 70 ~ 80% of the usual usage and limits at about two to three times that is a common starting point.
Collect operational data — gather a few days of the per-container CPU · memory time series that Prometheus + metrics-server, or an APM like Datadog / New Relic, exposes. Look at the p95 / p99 usage during the busiest hours.
Use the VPA recommender — running the Vertical Pod Autoscaler with updateMode: Off lets you receive only recommended values without actually changing resources. K8s learns the workload characteristics and proposes appropriate requests / limits. The VPA’s behavior is covered in depth in Chapter 13, Autoscaling.
Adjust and redeploy — combine the recommended values and the monitoring data to update the manifest’s requests / limits, reflected in the next deploy. Since growing requests requires a new Pod to come up, it usually becomes a rolling update.

Running this cycle just once per workload greatly changes the cluster’s overall resource efficiency and stability. Rather than spending time getting it right from the start, it’s faster to bring it up quickly with a reasonable value and adjust by data. The flow of refining this cycle from an operational-cost standpoint is organized in Chapter 28, Cost optimization.

The next chapter’s subject, liveness / readiness probes, ties directly into resource pressure too — when a Pod becomes slow to respond due to throttling, or stalls in memory GC right before OOM, the workload’s recovery behavior changes depending on how the probe detects that state.

Exercises #

After applying the main text’s deployment-with-resources.yaml unchanged, check the QoS class with kubectl get pod <name> -o jsonpath='{.status.qosClass}'. Next, organize in a table how the QoS changes when you keep only limits and drop requests, and when you drop both requests and limits. Note it in one paragraph, matching the model of §“The four requests / limits combinations.”
Deliberately create a scenario where CPU usage frequently hits limits — run a load that does a lot of compute in a short time (e.g., stress-ng --cpu 1 --timeout 60s) in a container with cpu: limits: 100m. Record in time order how the CPU value of kubectl top pod and the accumulation of the cAdvisor metric container_cpu_cfs_throttled_seconds_total move, and summarize in one paragraph, in your own words, the model of §“The trap of CPU limit,” that the container doesn’t die but only the response latency surges.
After applying a LimitRange to the dev namespace, try applying two things: a Deployment that left out requests / limits, and a Deployment that exceeds max. Record how K8s responds (auto-fill vs rejection) in each case’s kubectl describe pod or apply output, and reason out in one paragraph how §“LimitRange” and Chapter 14’s ResourceQuota pair up.

In one line: requests is the guaranteed amount the scheduler looks at when choosing a node, and limits is the runtime upper bound kubelet enforces with the cgroup. The combination of the two decides the QoS class (Guaranteed · Burstable · BestEffort), which becomes the eviction priority under node resource pressure. Exceeding CPU limits is throttling (a surge in response latency) and exceeding memory limits is OOMKilled (immediate termination), so the behavior differs, and runtimes like the JVM · Go need their cgroup awareness handled separately. The namespace-level policies pair LimitRange (per-container defaults and range) with ResourceQuota (aggregate cap).

Next chapter #

What we’ve covered up through this chapter was the model of the amount of resources a container receives. The next chapter’s subject moves the viewpoint from resources to the container’s aliveness — the model of how K8s figures out whether a container is operating normally, and how it detects an abnormal state and starts a recovery action.

Chapter 12, Health checks organizes the three kinds of probe together. The liveness probe is the signal that triggers a container restart, the readiness probe is the signal for removing from and adding to a Service’s endpoints, and the startup probe is the signal that gives a grace period to a container slow to start. We follow, in the shape of a single manifest, how the three probes’ responsibilities differ, which of the three check methods HTTP / TCP / exec to use and when, the meaning of tuning parameters like initialDelaySeconds / periodSeconds / failureThreshold, and how it ties into this chapter’s resource model.