Docker Advanced #5: Resource Limits and cgroups

Infrastructure Docker cgroups Resource limits

Thursday, April 23, 2026

8 min read

So far, container resource use has been treated as something the host figures out for you. That stops working in production — one container eats the host’s memory and kills other services, or pegs the CPU and adds latency. This post tackles resource limits in earnest.

This post in the Docker Advanced series:

#1 BuildKit and buildx
#2 Multi-architecture images
#3 Image security — non-root, distroless, scan (Trivy)
#4 SBOM and signing (cosign)
#5 Resource limits and cgroups ← this post
#6 Production operations — restart policy, healthcheck, graceful shutdown

cgroups — one axis of container isolation #

Briefly noted in #1, cgroups (control groups) are the Linux kernel’s resource-accounting and limiting feature. Namespaces make containers light; cgroups make running them safely possible.

There are two generations.

	cgroups v1	cgroups v2
Released	2007	2016
Layout	One hierarchy per resource	Single unified hierarchy
Memory accounting	Partial	Accurate
Docker support	Long	Stable on 20.10+

Most modern Linux distros default to v2. Docker Desktop too. This post assumes v2.

Check:

cgroups version

stat -fc %T /sys/fs/cgroup
# cgroup2fs   ← v2
# tmpfs       ← v1 (older systems)

docker info | grep Cgroup
# Cgroup Driver: systemd
# Cgroup Version: 2

Memory limits — `--memory` #

The most-used knob.

docker run

docker run -d --memory 512m myapp
docker run -d -m 512m myapp        # short

compose.yaml

services:
  web:
    image: myapp
    mem_limit: 512m         # or deploy.resources.limits.memory (Swarm)
    mem_reservation: 256m   # soft limit

`mem_limit` vs. `mem_reservation` #

Option	Meaning
`mem_limit`	Hard limit — exceed and you get OOMKilled
`mem_reservation`	Soft limit — keeps you out of preferred reclaim when the host is under pressure

In production, only mem_limit is usually set. mem_reservation matters in multi-tenant scenarios with many containers per host.

Unit syntax #

Units

512        # bytes (default)
512b       # bytes
512k       # kilobytes (1024 bytes)
512m       # megabytes
2g         # gigabytes

m is megabytes here. Don’t confuse with K8s’ 500m (0.5 cpu).

Swap #

--memory-swap

docker run -m 512m --memory-swap 1g myapp
# RAM 512m + swap (1g - 512m = 512m) = 1g total

docker run -m 512m --memory-swap -1 myapp
# Unlimited swap (until host limits)

docker run -m 512m --memory-swap 512m myapp
# No swap allowed (RAM limit is the total limit)

In production it’s typical to disable swap entirely on the host. Swap makes performance harder to predict.

OOMKilled — what happens past the limit #

When a container exceeds its memory limit, it ends as OOMKilled.

Diagnose OOMKilled

docker inspect myapp --format '{{.State.OOMKilled}}'
# true

docker inspect myapp --format '{{.State.ExitCode}}'
# 137   ← SIGKILL (128 + 9)

exit code 137 is essentially the OOMKilled signature. The host’s dmesg confirms:

dmesg log

sudo dmesg | grep -i 'killed process'
# Memory cgroup out of memory: Killed process 12345 (python) ...

If OOMKilled is frequent in production:

Limit too small — measure and raise it
App memory leak — track growth over time
Runtime doesn’t see the limit — next section

A container’s memory perception — runtime traps #

Inside a container, reading free or /proc/meminfo shows the host’s memory. The cgroup limit lives elsewhere.

Inside the container

docker run --rm -m 512m ubuntu free -m
#               total        used        free
# Mem:          15920         542       14253     ← host memory

Why this matters — some runtimes call free or Runtime.maxMemory and size themselves to the host, then blow past the cgroup limit and OOMKill themselves.

Java (JVM) #

JVM limit perception

# Old (early JVM 8): based on host memory → frequent OOMKills
java -Xmx2g app.jar

# JVM 10+: -XX:+UseContainerSupport (default) → reads cgroup limit
java -XX:MaxRAMPercentage=75.0 app.jar

JVM 10+ enables UseContainerSupport by default. Prefer MaxRAMPercentage over -Xmx in containers — it’s percentage-of-limit instead of absolute.

Node.js #

Node has the same shape. V8’s old-space defaults somewhere around 1.5–4GB and may not match your container limit.

Node — set memory limit explicitly

node --max-old-space-size=512 app.js

If the container limit is 512m, Node’s old-space limit should be near that.

Python #

CPython has a simple GC and nothing explicit to set. However, places like multiprocessing decide worker counts via os.cpu_count(), which returns the host’s core count — not the container’s. Set worker count via env vars explicitly.

CPU limits — `--cpus` / `--cpu-shares` #

CPU comes in two flavors.

CPU limits

# 1) Absolute — one core's worth
docker run --cpus 1.0 myapp

# 2) Absolute — 1.5 cores (one core full + half of another)
docker run --cpus 1.5 myapp

# 3) Relative weight — versus other containers
docker run --cpu-shares 512 myapp

Option	Meaning
`--cpus N`	Absolute CPU available (in cores)
`--cpu-shares`	Relative weight (default 1024). Distribution under contention.
`--cpuset-cpus 0-2`	Pin to specific cores (e.g., 0,1,2)

In production, --cpus for an absolute limit is more predictable. --cpu-shares only matters when you’re prioritizing among containers on one host.

How CFS quota actually behaves #

--cpus 1.0 translates to a CFS (Completely Fair Scheduler) quota: 100ms of CPU time per 100ms window. This sometimes causes unexpected throttling — a momentary burst gets cut short.

There’s an active opinion in K8s circles that cpu.cfs_period_us / cpu.cfs_quota_us is impractical, and some setups intentionally skip CPU limits (memory limits stay). For Docker on a single host, setting --cpus is normal.

A container’s CPU perception #

Runtimes like JVM / Node / Go pick GC threads / worker counts based on the core count. If os.cpu_count() returns the host count, you’re misaligned.

Inside the container

docker run --rm --cpus 0.5 alpine nproc
# 8           ← host core count, ignores limit

Fixes:

JVM 10+: UseContainerSupport handles it automatically
Node: set UV_THREADPOOL_SIZE to control the thread pool
Go: use automaxprocs to make runtime.GOMAXPROCS honor cgroup limits
Python: pass worker count via env

Go's automaxprocs

# go.mod
require go.uber.org/automaxprocs v1.5.3

# main.go
import _ "go.uber.org/automaxprocs"

A single import line and GOMAXPROCS auto-aligns to the cgroup limit.

Resources in `compose.yaml` #

Compose v2 shows two forms.

Simple form

services:
  web:
    image: myapp
    mem_limit: 512m
    mem_reservation: 256m
    cpus: 1.5
    pids_limit: 100

deploy form (Swarm-compatible)

services:
  web:
    image: myapp
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: '1.5'
        reservations:
          memory: 256M
          cpus: '0.5'

For plain docker compose up, the simple form works. deploy.resources only fully takes effect in Swarm, though recent Compose recognizes some of it on a single host. For single-host operation, the simple form is less confusing.

`pids_limit` — runaway process protection #

Effective against fork bombs and zombie accumulation.

PID limit

docker run --pids-limit 100 myapp

compose

services:
  web:
    pids_limit: 100

A web app rarely needs to spawn 100 processes. Capping prevents accidental runaway at the container level.

`ulimit` — file descriptors and friends #

Linux ulimit is settable per container. The most common is open file descriptors (nofile).

docker run

docker run --ulimit nofile=65536:65536 myapp

compose

services:
  web:
    ulimits:
      nofile:
        soft: 65536
        hard: 65536
      nproc: 4096

High-traffic servers and long-lived connection workers find the default 1024 too small. Raising it in production is common.

IO limits — `--device-write-bps` and friends #

Block IO can be capped via cgroups too. Not used as often, but useful in multi-tenant hosts where one container’s disk IO would impact others.

IO limits

docker run --device-write-bps /dev/sda:10mb myapp
# Cap this container's /dev/sda write speed at 10MB/s

Not a typical knob in single-container resource definitions.

Measuring resources — `docker stats` again #

The command from Intermediate #6. Reach for it when you want to see the effect of your limits.

Real-time usage

docker stats myapp
# CONTAINER     CPU %    MEM USAGE / LIMIT     MEM %     NET I/O    BLOCK I/O
# myapp-web-1   24.5%    312MiB / 512MiB       60.93%    12kB / 8kB   ...

If MEM % regularly exceeds 70–80% of the limit, the limit is small or there’s a leak. OOMKills happen suddenly — checking margin via stats day-to-day is safer.

`prometheus` / cAdvisor #

In production, you don’t watch stats with your eyes — you push the data to a time-series DB. cAdvisor exposes Docker’s cgroup accounting as Prometheus metrics.

Add cadvisor to compose

services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    ports:
      - "8080:8080"

Pair with Prometheus + Grafana for per-container CPU / memory / IO graphs. The first monitoring setup for Docker-only operation.

OOMKilled diagnostic flow #

When you see OOMKilled, run this:

Diagnostic sequence

# 1) Was it really OOMKilled?
docker inspect <c> --format '{{.State.OOMKilled}} {{.State.ExitCode}}'

# 2) Host dmesg record
sudo dmesg -T | grep -i oom

# 3) The limit
docker inspect <c> --format '{{.HostConfig.Memory}}'

# 4) Typical usage (live or from monitoring)
docker stats <c> --no-stream

# 5) Does the runtime see the limit (e.g., JVM)?
docker exec <c> java -XshowSettings:vm -version 2>&1 | grep MaxHeapSize

Getting this flow into muscle memory resolves 90% of memory incidents quickly.

Wrap-up #

The picture from this post:

Container resource limits run on cgroups v2 — paired with namespaces, the two axes of isolation.
--memory / mem_limit matters most. Setting limits on production containers is essentially required.
The OOMKilled signature: exit code 137 + State.OOMKilled: true.
Verify the runtime actually honors the limit — JVM MaxRAMPercentage, Node --max-old-space-size, Go automaxprocs.
CPU: --cpus for absolute limits, --cpu-shares for relative weight.
pids_limit and ulimit nofile show up often in production stability work.
Measurement: docker stats → cAdvisor + Prometheus + Grafana.

In the next post (#6 Production operations) we wrap up Docker Advanced. PID 1 signal handling, SIGTERM graceful shutdown, restart policies in depth, healthcheck from an operations angle, liveness vs. readiness — the details that keep a container stable in production.

cgroups — one axis of container isolation #

Memory limits — --memory #

mem_limit vs. mem_reservation #

Unit syntax #

Swap #

OOMKilled — what happens past the limit #

A container’s memory perception — runtime traps #

Java (JVM) #

Node.js #

Python #

CPU limits — --cpus / --cpu-shares #

How CFS quota actually behaves #

A container’s CPU perception #

Resources in compose.yaml #

pids_limit — runaway process protection #

ulimit — file descriptors and friends #

IO limits — --device-write-bps and friends #

Measuring resources — docker stats again #

prometheus / cAdvisor #

OOMKilled diagnostic flow #

Wrap-up #

Memory limits — `--memory` #

`mem_limit` vs. `mem_reservation` #

CPU limits — `--cpus` / `--cpu-shares` #

Resources in `compose.yaml` #

`pids_limit` — runaway process protection #

`ulimit` — file descriptors and friends #

IO limits — `--device-write-bps` and friends #

Measuring resources — `docker stats` again #

`prometheus` / cAdvisor #