Docker Advanced #5: Resource Limits and cgroups
So far, container resource use has been treated as something the host figures out for you. That stops working in production — one container eats the host’s memory and kills other services, or pegs the CPU and adds latency. This post tackles resource limits in earnest.
This post in the Docker Advanced series:
- #1 BuildKit and buildx
- #2 Multi-architecture images
- #3 Image security — non-root, distroless, scan (Trivy)
- #4 SBOM and signing (cosign)
- #5 Resource limits and cgroups ← this post
- #6 Production operations — restart policy, healthcheck, graceful shutdown
cgroups — one axis of container isolation #
Briefly noted in #1, cgroups (control groups) are the Linux kernel’s resource-accounting and limiting feature. Namespaces make containers light; cgroups make running them safely possible.
There are two generations.
| cgroups v1 | cgroups v2 | |
|---|---|---|
| Released | 2007 | 2016 |
| Layout | One hierarchy per resource | Single unified hierarchy |
| Memory accounting | Partial | Accurate |
| Docker support | Long | Stable on 20.10+ |
Most modern Linux distros default to v2. Docker Desktop too. This post assumes v2.
Check:
stat -fc %T /sys/fs/cgroup
# cgroup2fs ← v2
# tmpfs ← v1 (older systems)
docker info | grep Cgroup
# Cgroup Driver: systemd
# Cgroup Version: 2Memory limits — --memory
#
The most-used knob.
docker run -d --memory 512m myapp
docker run -d -m 512m myapp # shortservices:
web:
image: myapp
mem_limit: 512m # or deploy.resources.limits.memory (Swarm)
mem_reservation: 256m # soft limitmem_limit vs. mem_reservation
#
| Option | Meaning |
|---|---|
mem_limit | Hard limit — exceed and you get OOMKilled |
mem_reservation | Soft limit — keeps you out of preferred reclaim when the host is under pressure |
In production, only mem_limit is usually set. mem_reservation matters in multi-tenant scenarios with many containers per host.
Unit syntax #
512 # bytes (default)
512b # bytes
512k # kilobytes (1024 bytes)
512m # megabytes
2g # gigabytesm is megabytes here. Don’t confuse with K8s’ 500m (0.5 cpu).
Swap #
docker run -m 512m --memory-swap 1g myapp
# RAM 512m + swap (1g - 512m = 512m) = 1g total
docker run -m 512m --memory-swap -1 myapp
# Unlimited swap (until host limits)
docker run -m 512m --memory-swap 512m myapp
# No swap allowed (RAM limit is the total limit)In production it’s typical to disable swap entirely on the host. Swap makes performance harder to predict.
OOMKilled — what happens past the limit #
When a container exceeds its memory limit, it ends as OOMKilled.
docker inspect myapp --format '{{.State.OOMKilled}}'
# true
docker inspect myapp --format '{{.State.ExitCode}}'
# 137 ← SIGKILL (128 + 9)exit code 137 is essentially the OOMKilled signature. The host’s dmesg confirms:
sudo dmesg | grep -i 'killed process'
# Memory cgroup out of memory: Killed process 12345 (python) ...If OOMKilled is frequent in production:
- Limit too small — measure and raise it
- App memory leak — track growth over time
- Runtime doesn’t see the limit — next section
A container’s memory perception — runtime traps #
Inside a container, reading free or /proc/meminfo shows the host’s memory. The cgroup limit lives elsewhere.
docker run --rm -m 512m ubuntu free -m
# total used free
# Mem: 15920 542 14253 ← host memoryWhy this matters — some runtimes call free or Runtime.maxMemory and size themselves to the host, then blow past the cgroup limit and OOMKill themselves.
Java (JVM) #
# Old (early JVM 8): based on host memory → frequent OOMKills
java -Xmx2g app.jar
# JVM 10+: -XX:+UseContainerSupport (default) → reads cgroup limit
java -XX:MaxRAMPercentage=75.0 app.jarJVM 10+ enables UseContainerSupport by default. Prefer MaxRAMPercentage over -Xmx in containers — it’s percentage-of-limit instead of absolute.
Node.js #
Node has the same shape. V8’s old-space defaults somewhere around 1.5–4GB and may not match your container limit.
node --max-old-space-size=512 app.jsIf the container limit is 512m, Node’s old-space limit should be near that.
Python #
CPython has a simple GC and nothing explicit to set. However, places like multiprocessing decide worker counts via os.cpu_count(), which returns the host’s core count — not the container’s. Set worker count via env vars explicitly.
CPU limits — --cpus / --cpu-shares
#
CPU comes in two flavors.
# 1) Absolute — one core's worth
docker run --cpus 1.0 myapp
# 2) Absolute — 1.5 cores (one core full + half of another)
docker run --cpus 1.5 myapp
# 3) Relative weight — versus other containers
docker run --cpu-shares 512 myapp| Option | Meaning |
|---|---|
--cpus N | Absolute CPU available (in cores) |
--cpu-shares | Relative weight (default 1024). Distribution under contention. |
--cpuset-cpus 0-2 | Pin to specific cores (e.g., 0,1,2) |
In production, --cpus for an absolute limit is more predictable. --cpu-shares only matters when you’re prioritizing among containers on one host.
How CFS quota actually behaves #
--cpus 1.0 translates to a CFS (Completely Fair Scheduler) quota: 100ms of CPU time per 100ms window. This sometimes causes unexpected throttling — a momentary burst gets cut short.
There’s an active opinion in K8s circles that cpu.cfs_period_us / cpu.cfs_quota_us is impractical, and some setups intentionally skip CPU limits (memory limits stay). For Docker on a single host, setting --cpus is normal.
A container’s CPU perception #
Runtimes like JVM / Node / Go pick GC threads / worker counts based on the core count. If os.cpu_count() returns the host count, you’re misaligned.
docker run --rm --cpus 0.5 alpine nproc
# 8 ← host core count, ignores limitFixes:
- JVM 10+:
UseContainerSupporthandles it automatically - Node: set
UV_THREADPOOL_SIZEto control the thread pool - Go: use
automaxprocsto makeruntime.GOMAXPROCShonor cgroup limits - Python: pass worker count via env
# go.mod
require go.uber.org/automaxprocs v1.5.3
# main.go
import _ "go.uber.org/automaxprocs"A single import line and GOMAXPROCS auto-aligns to the cgroup limit.
Resources in compose.yaml
#
Compose v2 shows two forms.
services:
web:
image: myapp
mem_limit: 512m
mem_reservation: 256m
cpus: 1.5
pids_limit: 100services:
web:
image: myapp
deploy:
resources:
limits:
memory: 512M
cpus: '1.5'
reservations:
memory: 256M
cpus: '0.5'For plain docker compose up, the simple form works. deploy.resources only fully takes effect in Swarm, though recent Compose recognizes some of it on a single host. For single-host operation, the simple form is less confusing.
pids_limit — runaway process protection
#
Effective against fork bombs and zombie accumulation.
docker run --pids-limit 100 myappservices:
web:
pids_limit: 100A web app rarely needs to spawn 100 processes. Capping prevents accidental runaway at the container level.
ulimit — file descriptors and friends
#
Linux ulimit is settable per container. The most common is open file descriptors (nofile).
docker run --ulimit nofile=65536:65536 myappservices:
web:
ulimits:
nofile:
soft: 65536
hard: 65536
nproc: 4096High-traffic servers and long-lived connection workers find the default 1024 too small. Raising it in production is common.
IO limits — --device-write-bps and friends
#
Block IO can be capped via cgroups too. Not used as often, but useful in multi-tenant hosts where one container’s disk IO would impact others.
docker run --device-write-bps /dev/sda:10mb myapp
# Cap this container's /dev/sda write speed at 10MB/sNot a typical knob in single-container resource definitions.
Measuring resources — docker stats again
#
The command from Intermediate #6. Reach for it when you want to see the effect of your limits.
docker stats myapp
# CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O
# myapp-web-1 24.5% 312MiB / 512MiB 60.93% 12kB / 8kB ...If MEM % regularly exceeds 70–80% of the limit, the limit is small or there’s a leak. OOMKills happen suddenly — checking margin via stats day-to-day is safer.
prometheus / cAdvisor
#
In production, you don’t watch stats with your eyes — you push the data to a time-series DB. cAdvisor exposes Docker’s cgroup accounting as Prometheus metrics.
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- "8080:8080"Pair with Prometheus + Grafana for per-container CPU / memory / IO graphs. The first monitoring setup for Docker-only operation.
OOMKilled diagnostic flow #
When you see OOMKilled, run this:
# 1) Was it really OOMKilled?
docker inspect <c> --format '{{.State.OOMKilled}} {{.State.ExitCode}}'
# 2) Host dmesg record
sudo dmesg -T | grep -i oom
# 3) The limit
docker inspect <c> --format '{{.HostConfig.Memory}}'
# 4) Typical usage (live or from monitoring)
docker stats <c> --no-stream
# 5) Does the runtime see the limit (e.g., JVM)?
docker exec <c> java -XshowSettings:vm -version 2>&1 | grep MaxHeapSizeGetting this flow into muscle memory resolves 90% of memory incidents quickly.
Wrap-up #
The picture from this post:
- Container resource limits run on cgroups v2 — paired with namespaces, the two axes of isolation.
--memory/mem_limitmatters most. Setting limits on production containers is essentially required.- The OOMKilled signature: exit code 137 +
State.OOMKilled: true. - Verify the runtime actually honors the limit — JVM
MaxRAMPercentage, Node--max-old-space-size, Goautomaxprocs. - CPU:
--cpusfor absolute limits,--cpu-sharesfor relative weight. pids_limitandulimit nofileshow up often in production stability work.- Measurement:
docker stats→ cAdvisor + Prometheus + Grafana.
In the next post (#6 Production operations) we wrap up Docker Advanced. PID 1 signal handling, SIGTERM graceful shutdown, restart policies in depth, healthcheck from an operations angle, liveness vs. readiness — the details that keep a container stable in production.