K8s Intermediate #1: StatefulSet / DaemonSet / Job / CronJob — Controllers Beyond Deployment
The first post of the K8s Intermediate series. The Deployment we covered in Basics is a controller built around one pattern — “keep N identical Pods up.” But production clusters always have workloads Deployment can’t handle. This post covers the four controllers that fill those gaps — StatefulSet, DaemonSet, Job, CronJob — in one pass. For each, we start with “why Deployment doesn’t work,” then walk through the manifest and operational caveats in one cycle.
This series is K8s Intermediate, 7 posts.
- #1 StatefulSet / DaemonSet / Job / CronJob — Controllers beyond Deployment ← this post
- #2 PV / PVC / StorageClass — the persistent data model
- #3 Ingress and Ingress Controller — the external entry point
- #4 resources.requests / limits — Pod resource requests and limits
- #5 Health checks — liveness / readiness / startup probes
- #6 Autoscaling — HPA / VPA / Cluster Autoscaler
- #7 RBAC / NetworkPolicy / ResourceQuota — security and resource policy
kubectl apply into an error that points away from the real cause, leaving you to trace it back from the cluster side. Pasting the manifest into utilrepo’s YAML validator before applying surfaces syntax errors with line and column numbers. utilrepo is a collection of lightweight web utilities that run in your browser, so secrets never leave your machine, and it also catches multi-document manifests joined by --- and tab-space mixes you’d otherwise miss.Workloads Deployment can’t express #
The mental model of Deployment from Basics #4, in one line — keep N copies of the same Pod template up at all times, and replace them gradually when a new version arrives. This works well for stateless workloads — web servers, API servers, worker queue consumers — where Pods don’t have to be distinguished from each other. Whether it’s web-abc123-aa11 or web-abc123-bb22, the same code runs, and if one Pod dies another Pod takes over.
There are four patterns this model doesn’t fit:
- Workloads where Pods must be assumed different from each other — primary and replicas in a database cluster, broker-0 / broker-1 / broker-2 in Kafka. Each Pod needs its own identity and its own disk. Deployment Pods get random names and don’t share disks.
- Workloads that must run exactly one per node — log shippers, node monitoring agents, CNI (Container Network Interface) agents. What you need is “match the node count automatically,” not “a
replicascount” — and Deployment’sreplicasfield can’t express that intent. - Workloads that should run once and finish — DB migrations, one-shot data reports, cluster setup scripts. Deployment tries to restart a Pod when it terminates, but for these jobs finishing is the goal.
- Workloads that should run periodically — nightly backups, hourly cleanups, weekly reports. Cron-style scheduling has to live at the controller layer.
K8s splits these four into separate controllers — StatefulSet, DaemonSet, Job, CronJob. We’ll walk through each.
StatefulSet — for workloads that need identity and disks #
Try to run a database on K8s with Deployment and you hit a wall immediately. When a PostgreSQL primary dies and a new Pod comes up, that new Pod has to inherit the previous Pod’s data directory. A randomized name won’t do, and how the replicas address the primary needs to stay stable. Deployment guarantees none of these three.
StatefulSet solves three things:
- Stable Pod names — Pods get indexed names like
<name>-0,<name>-1,<name>-2. They keep the same index across restarts. Ifweb-0dies and comes back, it’sweb-0again. - A 1:1 persistent volume per Pod — PVCs declared via
volumeClaimTemplatesare created automatically per Pod.web-0gets adata-web-0PVC,web-1getsdata-web-1, and that mapping survives the Pod’s lifecycle. The PV / PVC model itself is covered in depth in #2. - Sequential lifecycle — by default, Pods are created in order from index 0; termination runs in reverse (from N-1). Rolling updates follow the same order. The model fits topologies where the primary has to be up before replicas can attach.
Pairs with a Headless Service #
A StatefulSet is usually created together with a headless Service, because each Pod needs a stable DNS name.
apiVersion: v1
kind: Service
metadata:
name: web
spec:
clusterIP: None
selector:
app: web
ports:
- port: 80
targetPort: 80The key is the one line clusterIP: None. This Service doesn’t take a virtual IP; instead, it creates an individual DNS record per Pod. Inside the cluster you can call each Pod by name:
web-0.web.default.svc.cluster.local
web-1.web.default.svc.cluster.local
web-2.web.default.svc.cluster.localThe form is <pod>.<headless-service>.<namespace>.svc.cluster.local. If a regular ClusterIP Service is “a virtual IP in front of multiple Pods,” a headless Service is “a stable name tag per Pod.”
The StatefulSet manifest #
The StatefulSet manifest that pairs with the headless Service above:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
serviceName: web
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: nginx
image: nginx:1.27
ports:
- containerPort: 80
volumeMounts:
- name: data
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 1GiThree things are different from Deployment:
spec.serviceName: web— points to the headless Service we made above. This is where StatefulSet registers its Pods’ DNS records.spec.volumeClaimTemplates— a template that auto-creates one PVC per Pod. The manifest above creates three PVCs (data-web-0,data-web-1,data-web-2) and mounts each to/usr/share/nginx/htmlon the corresponding Pod. Which actual disk a PVC lands on is decided byStorageClass’s dynamic provisioning — the topic of #2.replicasand Pod names — samereplicas: 3as Deployment, but the Pod names are pinned:web-0,web-1,web-2. There’s no intermediate ReplicaSet object either.
kubectl get pods,pvc -l app=webNAME READY STATUS RESTARTS AGE
pod/web-0 1/1 Running 0 1m
pod/web-1 1/1 Running 0 50s
pod/web-2 1/1 Running 0 40s
NAME STATUS VOLUME CAPACITY AGE
persistentvolumeclaim/data-web-0 Bound pvc-... 1Gi 1m
persistentvolumeclaim/data-web-1 Bound pvc-... 1Gi 50s
persistentvolumeclaim/data-web-2 Bound pvc-... 1Gi 40sPods come up staggered in the order 0, 1, 2, and you can see one PVC per Pod.
One operational caveat — PVCs survive scale-down #
Scale a StatefulSet from replicas: 3 down to replicas: 1 and Pods web-1, web-2 terminate, but the PVCs data-web-1, data-web-2 stick around. This is intentional — a safety net so you don’t accidentally lose data. Scale back up to replicas: 3 and the new web-1, web-2 re-mount those PVCs and see the previous data intact.
To clean up the PVCs you have to delete them explicitly:
kubectl delete pvc data-web-1 data-web-2That safety net means the data survives even when somebody accidentally scales a StatefulSet down. K8s 1.27+ lets you change this behavior with spec.persistentVolumeClaimRetentionPolicy, but for data preservation, leaving the default in place is safer.
DaemonSet — exactly one per node #
Production clusters always have workloads where you need to “look at each node’s state from inside that node.” Fluent Bit collecting container logs and shipping them centrally; Node Exporter measuring CPU / memory / disk and exposing it to Prometheus; CNI agents (Calico, Cilium) wiring up Pod networking. What these have in common — they should run as many copies as there are nodes.
Deployment’s replicas: N can’t express that intent. Every time the node count changes, someone has to update N by hand, and there’s no way to prevent two copies of the same Pod landing on one node, or none on another.
DaemonSet solves it cleanly — run exactly one Pod on each node in the cluster. When a new node joins, one starts on it automatically; when a node leaves, its Pod goes with it.
The DaemonSet manifest #
The biggest difference is no replicas field.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
hostNetwork: true
containers:
- name: node-exporter
image: prom/node-exporter:v1.8.2
args:
- --path.rootfs=/host
ports:
- containerPort: 9100
hostPort: 9100
volumeMounts:
- name: rootfs
mountPath: /host
readOnly: true
volumes:
- name: rootfs
hostPath:
path: /Same selector + template shape as Deployment, but no replicas. The count is decided by the node count. hostNetwork: true and hostPath volumes are common patterns in DaemonSet workloads — many of these workloads have to expose Pods on the node’s network interface directly, or look directly at the node’s filesystem.
kubectl get ds -n monitoring
kubectl get pods -n monitoring -o wideNAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
node-exporter 3 3 3 3 3 <none> 2m
NAME READY STATUS RESTARTS AGE IP NODE
node-exporter-7xk2p 1/1 Running 0 2m 10.0.0.11 node-1
node-exporter-9mn4v 1/1 Running 0 2m 10.0.0.12 node-2
node-exporter-bc8qr 1/1 Running 0 2m 10.0.0.13 node-3The point is that DESIRED 3 is auto-determined by the node count. Add another node and DESIRED flips to 4 and a new Pod starts on that node automatically.
Targeting only some nodes — nodeSelector / tolerations #
A DaemonSet by default puts a Pod on every worker node. In practice you often want only some — only GPU monitors on GPU nodes, or no workloads on control plane nodes.
Use nodeSelector to limit by node labels:
spec:
template:
spec:
nodeSelector:
hardware: gpuConversely, to also land on tainted nodes (e.g., the control plane), use tolerations:
spec:
template:
spec:
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoScheduleA real example — kube-proxy running in the cluster’s kube-system namespace is a DaemonSet. It has tolerations like the above so it lands on every node, including the control plane. Worth checking with kubectl get ds -n kube-system.
When a node is cordoned / drained #
Common commands during node maintenance are kubectl cordon and kubectl drain. cordon blocks new scheduling; drain moves Pods to other nodes. DaemonSet Pods are not moved by drain’s default behavior — being one-per-node is their job, so moving one to another node has no meaning. When drain stops because of DaemonSet Pods, the standard pattern is to add --ignore-daemonsets.
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-dataJob — work that runs once and finishes #
DB schema migrations, a one-shot data integrity check, a cluster initialization script. When this kind of work finishes, it’s done. What happens if you run a migration container as a Deployment? The moment the container exits cleanly (exit 0), Deployment assumes something went wrong and restarts it. The migration runs in an infinite loop — that’s the incident waiting to happen.
Job is the controller for this scenario. The model is the opposite of Deployment — a successful Pod termination is the normal outcome.
The Job manifest #
apiVersion: batch/v1
kind: Job
metadata:
name: db-migration
spec:
completions: 1
parallelism: 1
backoffLimit: 4
activeDeadlineSeconds: 600
template:
spec:
restartPolicy: OnFailure
containers:
- name: migrator
image: myapp/migrator:1.4.0
command: ["./migrate.sh"]
env:
- name: DB_HOST
value: postgres.default.svc.cluster.localapiVersion: batch/v1 is new. The Deployment family is apps/v1, but Job / CronJob are in a separate group. Key fields, one line each:
completions: 1— the number of times a Pod must terminate successfully. The example above is 1 and done. Set to N to process a large dataset split into N pieces.parallelism: 1— the number of Pods up at once. Withcompletions: 10andparallelism: 3, 10 items are processed 3 at a time in parallel.backoffLimit: 4— the max number of Pod retries on failure. Default is 6. If exceeded, the Job itself ends upFailed.activeDeadlineSeconds: 600— wall-clock cap for the entire Job. If it doesn’t finish within 600 seconds, the Pod is force-terminated. A safety net for migrations stuck in an infinite loop.
restartPolicy is restricted #
Pod’s restartPolicy usually has three values — Always, OnFailure, Never — but Job’s Pod template doesn’t allow Always. apiserver rejects the manifest if you write Always.
The reason is simple. Always means “restart the Pod no matter how it ended (success or failure),” but Job is a workload that expects to terminate. Allowing Always would mean restarting even on success and would erase the meaning of Job. So only OnFailure (retry on failure only) or Never (never retry, but create a fresh Pod) are allowed.
The two differ subtly — OnFailure restarts the container inside the same Pod, while Never marks that Pod as failed and creates a new Pod from scratch. If you want logs preserved for debugging, Never is usually the pick; if you want fast retries, OnFailure.
Watching the Job run #
kubectl apply -f db-migration-job.yaml
kubectl get jobs
kubectl get pods --selector=job-name=db-migrationNAME COMPLETIONS DURATION AGE
db-migration 0/1 20s 20s
NAME READY STATUS RESTARTS AGE
db-migration-xkz2p 1/1 Running 0 20sNAME COMPLETIONS DURATION AGE
db-migration 1/1 45s 2m
NAME READY STATUS RESTARTS AGE
db-migration-xkz2p 0/1 Completed 0 2mCOMPLETIONS 1/1 and the Pod ending in Completed is the shape of normal termination. Logs come straight back via kubectl logs db-migration-xkz2p. Job objects remain in the cluster until you explicitly run kubectl delete job db-migration — useful for auditing history, or add ttlSecondsAfterFinished to auto-clean them.
CronJob — periodic execution #
DB backup at 3 AM every day, hourly cleanup of temp files, weekly stats report on Monday morning. That pattern is CronJob. The model is straightforward — create a Job object on a cron schedule. Think of it as a cron scheduler stacked on top of Job.
The CronJob manifest #
apiVersion: batch/v1
kind: CronJob
metadata:
name: db-backup
spec:
schedule: "0 3 * * *"
timeZone: "Asia/Seoul"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1
startingDeadlineSeconds: 300
jobTemplate:
spec:
backoffLimit: 2
activeDeadlineSeconds: 1800
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup
image: myapp/backup:2.1.0
command: ["/usr/local/bin/backup.sh"]
env:
- name: S3_BUCKET
value: my-backupsThe CronJob manifest has two layers — the outer spec’s scheduling fields and the inner jobTemplate’s Job definition. The inner jobTemplate is shaped exactly like the spec of the Job manifest above.
The outer fields, one line each:
schedule: "0 3 * * *"— a standard 5-field cron expression. In order:minute hour day month dow. The example above is daily at 3:00 AM. Standard cron syntax like*/15 * * * *(every 15 minutes) or0 9 * * 1-5(weekday mornings at 9) works as-is.timeZone: "Asia/Seoul"— stable since 1.27. Before that, CronJob times followed the timezone of the control plane components and were typically interpreted as UTC, leading to incidents like “why is my 3 AM backup running at noon?” Specifying this field removes that ambiguity.concurrencyPolicy— what to do when a previous run is still going at the next scheduled time. Default isAllow.successfulJobsHistoryLimit/failedJobsHistoryLimit— how many successful / failed Job objects to keep around in the cluster. Defaults are 3 and 1. Setting them too high accumulates Jobs in etcd.startingDeadlineSeconds: 300— if a run hasn’t started within this many seconds after its scheduled time, skip it. A safety net against the failure mode where the control plane was paused, then recovers, and tries to fire all the missed runs at once.
concurrencyPolicy — three choices #
Leaving the default Allow makes ops mistakes easy. The three options behave clearly differently.
| Policy | Behavior |
|---|---|
Allow (default) | Even if the previous run isn’t done, create a new Job for this run. Multiple can run at once. |
Forbid | If the previous run isn’t done, skip this run. |
Replace | Kill the previous run’s Job and replace it with this run. |
For workloads like DB backups where two of the same shouldn’t touch the same data — Forbid is the right answer. If a backup takes 30 minutes and the schedule is hourly, Allow means new backups pile up every hour. For “only the latest run needs to be alive” workloads (e.g., cache warming), Replace is right.
The risk without startingDeadlineSeconds #
A subtle CronJob trap is startingDeadlineSeconds. Without it (or with a very large value), if the control plane is paused for a while and then recovers, it may try to fire all the missed runs at once. A CronJob that runs every minute, paused for an hour, can produce 60 Jobs simultaneously when it wakes up.
In production, almost always set startingDeadlineSeconds to a sensible value (e.g., 300 seconds). Skipping a run that didn’t start in time is in almost every case better than firing 60 at once on wake-up.
Watching CronJob #
kubectl get cronjob,jobs,podsNAME SCHEDULE TIMEZONE LAST SCHEDULE AGE
cronjob.batch/db-backup 0 3 * * * Asia/Seoul 8h 2d
NAME COMPLETIONS DURATION AGE
job.batch/db-backup-29345400 1/1 14m 8h
job.batch/db-backup-29346840 1/1 13m 20m
NAME READY STATUS RESTARTS AGE
pod/db-backup-29346840-7kxqr 0/1 Completed 0 20mYou can see the three layers — one CronJob, a Job per run underneath, and one Pod per Job that ran once and ended Completed. Completed Jobs are retained up to the successfulJobsHistoryLimit count for postmortem debugging.
When to use which controller #
Adding Deployment from Basics to the four here, the five controllers in one table.
| Controller | Best for | Pod identifier | Termination model |
|---|---|---|---|
| Deployment | Stateless web/API servers, worker consumers | Random (web-abc-aa11) | Restart on death |
| StatefulSet | DBs, message queue brokers, distributed caches | web-0, web-1 (fixed) | Restart with the same index |
| DaemonSet | Node agents, log shippers, CNI | One per node | Restart on death |
| Job | DB migrations, one-shot batches | Random | Done when it succeeds |
| CronJob | Periodic backups, cleanups, reports | A Job per run | Each run follows the Job model |
The mental decision tree is simple:
- Can Pods be identical? — if not, StatefulSet; if yes, next question.
- Must there be exactly one per node? — if yes, DaemonSet; if no, next question.
- Should it run once and finish? — if yes, CronJob for periodic, Job for one-shot. Otherwise, Deployment.
Once you know these four controllers, you can read the intent of any kind: in a cluster’s manifest directory in one line.
Summary #
What this post pinned down:
- Deployment sits on the stateless assumption — Pods are interchangeable, restart on death, simple model. Identity, per-node, one-shot, periodic workloads need other controllers.
- StatefulSet —
serviceName(headless Service) andvolumeClaimTemplatesare the core. Pod names like<name>-0,<name>-1are stable; each Pod owns its PVC. PVCs survive scale-down. - DaemonSet — no
replicas. Auto-matches the node count, one per node.nodeSelector/tolerationslimit it to specific nodes;kube-proxyis the canonical example. - Job —
apiVersion: batch/v1.completions,parallelism,backoffLimit,activeDeadlineSecondsshape behavior.restartPolicyis restricted toOnFailure/Never. - CronJob — a cron scheduler on top of Job. The 5-field
schedule,timeZone(1.27+),concurrencyPolicy(Allow/Forbid/Replace), andstartingDeadlineSecondsto prevent missed-run pile-ups. - The decision tree across the five — three questions about Pod identity, per-node need, and termination expectation.
Next — PV / PVC / StorageClass #
This post mentioned in one line that StatefulSet’s volumeClaimTemplates auto-creates PVCs, but didn’t cover what disks they actually map to or how. In production, behind that one line is the triangle of PV (PersistentVolume), PVC (PersistentVolumeClaim), StorageClass — how a Pod’s lifecycle separates from a disk’s lifecycle, how disks get dynamically provisioned, what the differences between accessModes (ReadWriteOnce, ReadOnlyMany, ReadWriteMany) actually mean, and how reclaimPolicy decides what happens to the disk when a PVC disappears.
#2 PV / PVC / StorageClass — the persistent data model sorts out the relationships between these three objects and walks through what StatefulSet’s volumeClaimTemplates actually produces on top of them, in one cycle.