Docker Advanced #5: Resource Limits and cgroups

8 min read

So far, container resource use has been treated as something the host figures out for you. That stops working in production — one container eats the host’s memory and kills other services, or pegs the CPU and adds latency. This post tackles resource limits in earnest.

This post in the Docker Advanced series:

cgroups — one axis of container isolation #

Briefly noted in #1, cgroups (control groups) are the Linux kernel’s resource-accounting and limiting feature. Namespaces make containers light; cgroups make running them safely possible.

There are two generations.

cgroups v1cgroups v2
Released20072016
LayoutOne hierarchy per resourceSingle unified hierarchy
Memory accountingPartialAccurate
Docker supportLongStable on 20.10+

Most modern Linux distros default to v2. Docker Desktop too. This post assumes v2.

Check:

cgroups version
stat -fc %T /sys/fs/cgroup
# cgroup2fs   ← v2
# tmpfs       ← v1 (older systems)

docker info | grep Cgroup
# Cgroup Driver: systemd
# Cgroup Version: 2

Memory limits — --memory #

The most-used knob.

docker run
docker run -d --memory 512m myapp
docker run -d -m 512m myapp        # short
compose.yaml
services:
  web:
    image: myapp
    mem_limit: 512m         # or deploy.resources.limits.memory (Swarm)
    mem_reservation: 256m   # soft limit

mem_limit vs. mem_reservation #

OptionMeaning
mem_limitHard limit — exceed and you get OOMKilled
mem_reservationSoft limit — keeps you out of preferred reclaim when the host is under pressure

In production, only mem_limit is usually set. mem_reservation matters in multi-tenant scenarios with many containers per host.

Unit syntax #

Units
512        # bytes (default)
512b       # bytes
512k       # kilobytes (1024 bytes)
512m       # megabytes
2g         # gigabytes

m is megabytes here. Don’t confuse with K8s’ 500m (0.5 cpu).

Swap #

--memory-swap
docker run -m 512m --memory-swap 1g myapp
# RAM 512m + swap (1g - 512m = 512m) = 1g total

docker run -m 512m --memory-swap -1 myapp
# Unlimited swap (until host limits)

docker run -m 512m --memory-swap 512m myapp
# No swap allowed (RAM limit is the total limit)

In production it’s typical to disable swap entirely on the host. Swap makes performance harder to predict.

OOMKilled — what happens past the limit #

When a container exceeds its memory limit, it ends as OOMKilled.

Diagnose OOMKilled
docker inspect myapp --format '{{.State.OOMKilled}}'
# true

docker inspect myapp --format '{{.State.ExitCode}}'
# 137   ← SIGKILL (128 + 9)

exit code 137 is essentially the OOMKilled signature. The host’s dmesg confirms:

dmesg log
sudo dmesg | grep -i 'killed process'
# Memory cgroup out of memory: Killed process 12345 (python) ...

If OOMKilled is frequent in production:

  1. Limit too small — measure and raise it
  2. App memory leak — track growth over time
  3. Runtime doesn’t see the limit — next section

A container’s memory perception — runtime traps #

Inside a container, reading free or /proc/meminfo shows the host’s memory. The cgroup limit lives elsewhere.

Inside the container
docker run --rm -m 512m ubuntu free -m
#               total        used        free
# Mem:          15920         542       14253     ← host memory

Why this matters — some runtimes call free or Runtime.maxMemory and size themselves to the host, then blow past the cgroup limit and OOMKill themselves.

Java (JVM) #

JVM limit perception
# Old (early JVM 8): based on host memory → frequent OOMKills
java -Xmx2g app.jar

# JVM 10+: -XX:+UseContainerSupport (default) → reads cgroup limit
java -XX:MaxRAMPercentage=75.0 app.jar

JVM 10+ enables UseContainerSupport by default. Prefer MaxRAMPercentage over -Xmx in containers — it’s percentage-of-limit instead of absolute.

Node.js #

Node has the same shape. V8’s old-space defaults somewhere around 1.5–4GB and may not match your container limit.

Node — set memory limit explicitly
node --max-old-space-size=512 app.js

If the container limit is 512m, Node’s old-space limit should be near that.

Python #

CPython has a simple GC and nothing explicit to set. However, places like multiprocessing decide worker counts via os.cpu_count(), which returns the host’s core count — not the container’s. Set worker count via env vars explicitly.

CPU limits — --cpus / --cpu-shares #

CPU comes in two flavors.

CPU limits
# 1) Absolute — one core's worth
docker run --cpus 1.0 myapp

# 2) Absolute — 1.5 cores (one core full + half of another)
docker run --cpus 1.5 myapp

# 3) Relative weight — versus other containers
docker run --cpu-shares 512 myapp
OptionMeaning
--cpus NAbsolute CPU available (in cores)
--cpu-sharesRelative weight (default 1024). Distribution under contention.
--cpuset-cpus 0-2Pin to specific cores (e.g., 0,1,2)

In production, --cpus for an absolute limit is more predictable. --cpu-shares only matters when you’re prioritizing among containers on one host.

How CFS quota actually behaves #

--cpus 1.0 translates to a CFS (Completely Fair Scheduler) quota: 100ms of CPU time per 100ms window. This sometimes causes unexpected throttling — a momentary burst gets cut short.

There’s an active opinion in K8s circles that cpu.cfs_period_us / cpu.cfs_quota_us is impractical, and some setups intentionally skip CPU limits (memory limits stay). For Docker on a single host, setting --cpus is normal.

A container’s CPU perception #

Runtimes like JVM / Node / Go pick GC threads / worker counts based on the core count. If os.cpu_count() returns the host count, you’re misaligned.

Inside the container
docker run --rm --cpus 0.5 alpine nproc
# 8           ← host core count, ignores limit

Fixes:

  • JVM 10+: UseContainerSupport handles it automatically
  • Node: set UV_THREADPOOL_SIZE to control the thread pool
  • Go: use automaxprocs to make runtime.GOMAXPROCS honor cgroup limits
  • Python: pass worker count via env
Go's automaxprocs
# go.mod
require go.uber.org/automaxprocs v1.5.3

# main.go
import _ "go.uber.org/automaxprocs"

A single import line and GOMAXPROCS auto-aligns to the cgroup limit.

Resources in compose.yaml #

Compose v2 shows two forms.

Simple form
services:
  web:
    image: myapp
    mem_limit: 512m
    mem_reservation: 256m
    cpus: 1.5
    pids_limit: 100
deploy form (Swarm-compatible)
services:
  web:
    image: myapp
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: '1.5'
        reservations:
          memory: 256M
          cpus: '0.5'

For plain docker compose up, the simple form works. deploy.resources only fully takes effect in Swarm, though recent Compose recognizes some of it on a single host. For single-host operation, the simple form is less confusing.

pids_limit — runaway process protection #

Effective against fork bombs and zombie accumulation.

PID limit
docker run --pids-limit 100 myapp
compose
services:
  web:
    pids_limit: 100

A web app rarely needs to spawn 100 processes. Capping prevents accidental runaway at the container level.

ulimit — file descriptors and friends #

Linux ulimit is settable per container. The most common is open file descriptors (nofile).

docker run
docker run --ulimit nofile=65536:65536 myapp
compose
services:
  web:
    ulimits:
      nofile:
        soft: 65536
        hard: 65536
      nproc: 4096

High-traffic servers and long-lived connection workers find the default 1024 too small. Raising it in production is common.

IO limits — --device-write-bps and friends #

Block IO can be capped via cgroups too. Not used as often, but useful in multi-tenant hosts where one container’s disk IO would impact others.

IO limits
docker run --device-write-bps /dev/sda:10mb myapp
# Cap this container's /dev/sda write speed at 10MB/s

Not a typical knob in single-container resource definitions.

Measuring resources — docker stats again #

The command from Intermediate #6. Reach for it when you want to see the effect of your limits.

Real-time usage
docker stats myapp
# CONTAINER     CPU %    MEM USAGE / LIMIT     MEM %     NET I/O    BLOCK I/O
# myapp-web-1   24.5%    312MiB / 512MiB       60.93%    12kB / 8kB   ...

If MEM % regularly exceeds 70–80% of the limit, the limit is small or there’s a leak. OOMKills happen suddenly — checking margin via stats day-to-day is safer.

prometheus / cAdvisor #

In production, you don’t watch stats with your eyes — you push the data to a time-series DB. cAdvisor exposes Docker’s cgroup accounting as Prometheus metrics.

Add cadvisor to compose
services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    ports:
      - "8080:8080"

Pair with Prometheus + Grafana for per-container CPU / memory / IO graphs. The first monitoring setup for Docker-only operation.

OOMKilled diagnostic flow #

When you see OOMKilled, run this:

Diagnostic sequence
# 1) Was it really OOMKilled?
docker inspect <c> --format '{{.State.OOMKilled}} {{.State.ExitCode}}'

# 2) Host dmesg record
sudo dmesg -T | grep -i oom

# 3) The limit
docker inspect <c> --format '{{.HostConfig.Memory}}'

# 4) Typical usage (live or from monitoring)
docker stats <c> --no-stream

# 5) Does the runtime see the limit (e.g., JVM)?
docker exec <c> java -XshowSettings:vm -version 2>&1 | grep MaxHeapSize

Getting this flow into muscle memory resolves 90% of memory incidents quickly.

Wrap-up #

The picture from this post:

  • Container resource limits run on cgroups v2 — paired with namespaces, the two axes of isolation.
  • --memory / mem_limit matters most. Setting limits on production containers is essentially required.
  • The OOMKilled signature: exit code 137 + State.OOMKilled: true.
  • Verify the runtime actually honors the limit — JVM MaxRAMPercentage, Node --max-old-space-size, Go automaxprocs.
  • CPU: --cpus for absolute limits, --cpu-shares for relative weight.
  • pids_limit and ulimit nofile show up often in production stability work.
  • Measurement: docker stats → cAdvisor + Prometheus + Grafana.

In the next post (#6 Production operations) we wrap up Docker Advanced. PID 1 signal handling, SIGTERM graceful shutdown, restart policies in depth, healthcheck from an operations angle, liveness vs. readiness — the details that keep a container stable in production.

X