Docker Advanced #6: Production Operations — graceful shutdown, healthcheck, restart

Infrastructure Docker Production graceful shutdown

Friday, April 24, 2026

9 min read

The final post of Docker Advanced. Build / multi-arch / security / resource limits — all the previous posts dealt with the shape of one container. This post collects the details that keep a container shutting down cleanly and recovering reliably in production.

This post in the Docker Advanced series:

#1 BuildKit and buildx
#2 Multi-architecture images
#3 Image security
#4 SBOM and signing
#5 Resource limits and cgroups
#6 Production operations — restart policy, healthcheck, graceful shutdown ← this post

What `docker stop` actually does — once more, deeper #

The territory briefly touched in Basics #3. From an operations angle:

docker stop flow

docker stop myapp
   │
   ▼
SIGTERM is sent to the container's PID 1
   │
   ▼
Wait 10 seconds by default (--time to adjust)
   │
   ├─ PID 1 exits cleanly → that exit code stays
   │
   └─ Timeout → SIGKILL

The whole weight of this flow rides on PID 1. PID 1 must catch SIGTERM, propagate it to its children, and finish its own cleanup — only then does the container shut down gracefully.

The PID 1 problem — common breakage inside containers #

PID 1 is special on Linux:

Adopts orphaned children — must reap zombies
Different signal-delivery rules — signals without an explicit handler are ignored

Most apps (Python, Node, Java) were never designed to run as PID 1. Two problems result inside containers:

Problem 1 — SIGTERM is ignored #

Most runtimes ignore SIGTERM unless a handler is explicitly registered. Docker’s SIGTERM goes unhandled, and 10 seconds later SIGKILL takes the container out forcefully.

Diagnosis:

Does the app receive SIGTERM?

docker run --rm -d --name test myapp
docker stop test
# If shutdown takes ~10 seconds, SIGTERM is likely being ignored.
# If it ends in 1–2 seconds, you're good.

Problem 2 — Zombie process accumulation #

If the app spawns child processes (Node’s child_process.spawn, Python’s subprocess.Popen followed by quick exits) and the parent doesn’t wait() for them, they become zombies. Normally an init process adopts and reaps them, but if the container’s PID 1 doesn’t do that role, zombies pile up.

Solution — a small init at PID 1 #

The fix is to put a small init at PID 1 with the app underneath. Docker provides this with one option.

--init

docker run --init -d myapp

compose

services:
  web:
    image: myapp
    init: true

--init runs tini as PID 1 and your Dockerfile’s CMD becomes its child. tini:

Forwards received SIGTERMs to children
Automatically reaps zombies

That one line solves both problems. Almost always set init: true on production containers.

`dumb-init` — baked into the Dockerfile #

The other path: run dumb-init (Yelp) as ENTRYPOINT.

Dockerfile

FROM python:3.14-slim
RUN apt-get update && apt-get install -y --no-install-recommends dumb-init && \
    rm -rf /var/lib/apt/lists/*
COPY app.py .
ENTRYPOINT ["dumb-init", "--"]
CMD ["python", "app.py"]

dumb-init -- becomes PID 1 with Python as its child. Same idea as tini. Docker’s --init is lighter and preferred, but when you don’t know where the image will run (no guarantee --init will be set), baking dumb-init into the image is safer.

The app’s own SIGTERM handler #

Even with init handling signal delivery, the app must respond to the signal for graceful to actually happen. Quick patterns per language.

Node.js #

server.js

const server = app.listen(3000);

const shutdown = () => {
  console.log('Received SIGTERM, draining connections...');
  server.close(() => {
    console.log('All connections drained, exiting.');
    process.exit(0);
  });

  // Force exit if not finished cleanly within 30 seconds
  setTimeout(() => {
    console.error('Force exit after 30s');
    process.exit(1);
  }, 30000).unref();
};

process.on('SIGTERM', shutdown);
process.on('SIGINT', shutdown);

server.close() rejects new connections and calls back when in-flight requests finish. Get under the SIGTERM grace window (10s by default, or your extended value) to avoid SIGKILL.

Python (FastAPI / Django) #

Production servers like uvicorn / gunicorn handle SIGTERM automatically — you rarely write this yourself. Just ensure workers have enough time to finish in-flight requests.

gunicorn options

gunicorn app:app \
  --workers 4 \
  --graceful-timeout 30 \
  --timeout 60 \
  --bind 0.0.0.0:8000

--graceful-timeout 30 — keep handling requests for 30s after SIGTERM. Match Docker’s stop timeout:

compose

services:
  web:
    stop_grace_period: 35s   # slightly longer than gunicorn's graceful-timeout

Go #

main.go

srv := &http.Server{Addr: ":8000", Handler: mux}
go srv.ListenAndServe()

stop := make(chan os.Signal, 1)
signal.Notify(stop, syscall.SIGTERM, syscall.SIGINT)
<-stop

ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
srv.Shutdown(ctx)

http.Server.Shutdown is the canonical pattern — finish in-flight requests within a context timeout, then exit.

`stop_grace_period` — extending the window #

For apps that can’t finish cleanly in 10 seconds (e.g., processing large file uploads):

compose

services:
  web:
    stop_grace_period: 60s

docker stop

docker stop --time 60 myapp

A common production tuning. But — too long slows deployments, and load balancers like ELB may already mark the backend unhealthy, making the long window pointless.

Restart policies — deeper #

The table from Intermediate #4, now from an operations angle.

Policy	When
`no`	One-shot containers (migrations, seeds, builds)
`always`	Always — even on host boot
`on-failure[:N]`	Non-zero exit code, with a max retry count
`unless-stopped`	Always, unless explicitly stopped

Safe production default — `unless-stopped` #

The difference between always and unless-stopped confuses people. The distinction is what docker stop means:

With always, stopping with docker stop and restarting the daemon brings the container back.
With unless-stopped, a container stopped via docker stop stays stopped across daemon restarts.

It’s more reasonable to honor an operator’s explicit stop, so unless-stopped is the production default.

Restart loop — backoff #

If an app dies on startup every time, restart: always becomes an infinite loop. Docker prevents that with a backoff that increases the gap between restarts.

restart backoff

1st failure → retry immediately
2nd failure → wait 100ms
3rd failure → wait 200ms
...
Nth failure → up to 1 minute

For containers with many consecutive failures, follow the logs — docker logs --tail 200 <c> and check OOMKilled (#5).

Healthcheck — from operations #

The healthcheck from Intermediate #4, now from an operations angle.

Liveness vs. Readiness — two different questions #

Concepts from K8s, but equally useful as a thinking tool for Docker.

	Liveness	Readiness
Question	Is it alive?	Is it ready to receive traffic?
On failure	Restart the container	Block traffic (don’t restart)
Example	Stuck in deadlock → needs restart	DB temporarily disconnected → just pause traffic

Docker only has one healthcheck, no distinction. So in Docker-only setups, mixing both concepts in one healthcheck is awkward.

Workaround — define healthcheck closer to liveness, and handle readiness inside the app. For example, return 503 from /health for the first N seconds after startup, then 200 once ready.

Properties of a good healthcheck #

Good healthcheck

□ Responds quickly (within 1s)
□ Doesn't recursively check downstream services
□ No side effects
□ Dedicated endpoint (/health) — separated from regular traffic
□ No auth (don't add an attack surface; access from inside only)

Bad healthcheck

✗ Runs DB queries — DB load
✗ Runs business logic — dependencies / load
✗ Calls external APIs — when an external dep goes down, your container goes unhealthy
✗ Requires auth — health checkers also need credentials

A healthcheck only needs to confirm the app is alive and able to serve a request. The health of dependencies belongs to separate monitoring.

Startup grace — `start_period` #

For apps with migrations or warm-up, a brief unhealthy period right after startup is normal.

With start_period

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
  interval: 10s
  timeout: 5s
  retries: 3
  start_period: 60s   # failures during the first 60 seconds aren't counted

In K8s, this is startupProbe.

Logging — operational details #

The stdout principle and log drivers from Intermediate #6, now from an operations angle.

Log rotation #

Required options

services:
  web:
    logging:
      driver: json-file
      options:
        max-size: "10m"
        max-file: "3"

Without these, unbounded disk growth is a classic Docker incident. Set them daemon-wide for safety.

/etc/docker/daemon.json

{
  "log-driver": "local",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

The local driver is an efficient json-file variant (compression + rotation by default). Becoming the production standard.

To external collectors #

As production scales, logs don’t stop at stdout — they flow to external collectors.

To fluentd

services:
  web:
    logging:
      driver: fluentd
      options:
        fluentd-address: localhost:24224
        tag: web.{{.Name}}

From there, route to Loki / Elasticsearch / CloudWatch — anywhere. One-paragraph mention.

Monitoring — a one-line extension #

cAdvisor + Prometheus + Grafana from #5 is the first monitoring setup for Docker-only operations. Common panels:

CPU / memory / network per container
Restart count (alarm on containers restarting often)
OOMKill events
Healthcheck failure rate
Disk IO

A first alarm rule: “the same container restarts 3+ times in 5 minutes.” Catch frequent OOMKills / crashes early.

Operations checklist #

The checklist for one container, gathering everything across the series:

Image / build

□ Multi-stage — separate build tools ([Intermediate #1])
□ Base: slim or distroless ([Intermediate #1], [#3])
□ Multi-arch — linux/amd64 + linux/arm64 ([#2])
□ Dockerfile: hadolint clean ([#3])
□ Image: Trivy HIGH/CRITICAL clean ([#3])
□ SBOM attached + cosign signed ([#4])
□ Build with buildx + external cache ([Intermediate #2], [#1])

Runtime / compose.yaml

□ image: digest or semver (no latest)
□ restart: unless-stopped
□ init: true (PID 1 handling)
□ stop_grace_period set (longer than the app's graceful time)
□ healthcheck — fast, light, no auth
□ Resources: mem_limit + cpus + pids_limit ([#5])
□ Security: read_only + tmpfs + cap_drop ALL + no-new-privileges ([#3])
□ Secrets: secrets: or external manager — never in ENV ([Intermediate #5], [#4])
□ Logs: max-size + max-file
□ DB / internal services: bind -p only to 127.0.0.1
□ Per-environment values: .env / override files ([Intermediate #4])

Deploy / CI

□ Build → multi-arch → SBOM → sign → push in one workflow ([#4])
□ Make verification a gate — Trivy / cosign verify
□ Tagging: semver + Git SHA + latest together ([Basics #5])
□ External cache: type=gha or type=registry ([Intermediate #2])

What’s next — Docker in Practice #

This series went deep on Docker itself. The next series — Docker in Practice — puts everything we’ve built into real app deploys:

FastAPI containerization — a production-grade Dockerfile
Django + PostgreSQL compose — admin / static / migration too
React/Next.js build container — standalone, multi-stage
Building images in CI — full GitHub Actions workflow
Registry push and tag strategy — operational details
Cloud deploy — one of Fly.io / Railway / ECS

A series where every tool we’ve built across Basics / Intermediate / Advanced finally comes together.

Wrap-up #

The picture from this post:

docker stop = SIGTERM → grace window → SIGKILL. PID 1’s signal handling is the core.
The PID 1 problem — apps weren’t designed for PID 1. Use init: true or dumb-init.
The app handles SIGTERM to finish in-flight work — Node’s server.close, gunicorn’s --graceful-timeout, Go’s srv.Shutdown.
stop_grace_period ensures enough time to clean up.
restart: unless-stopped is the production default; backoff prevents infinite loops.
Healthcheck: fast, light, closer to liveness. Dependency checks belong elsewhere.
Operations checklist split across image / runtime / deploy.

What docker stop actually does — once more, deeper #

The PID 1 problem — common breakage inside containers #

Problem 1 — SIGTERM is ignored #

Problem 2 — Zombie process accumulation #

Solution — a small init at PID 1 #

dumb-init — baked into the Dockerfile #

The app’s own SIGTERM handler #

Node.js #

Python (FastAPI / Django) #

Go #

stop_grace_period — extending the window #

Restart policies — deeper #

Safe production default — unless-stopped #

Restart loop — backoff #

Healthcheck — from operations #

Liveness vs. Readiness — two different questions #

Properties of a good healthcheck #

Startup grace — start_period #

Logging — operational details #

Log rotation #

To external collectors #

Monitoring — a one-line extension #

Operations checklist #

What’s next — Docker in Practice #

Wrap-up #

What `docker stop` actually does — once more, deeper #

`dumb-init` — baked into the Dockerfile #

`stop_grace_period` — extending the window #

Safe production default — `unless-stopped` #

Startup grace — `start_period` #