K8s Intermediate #1: StatefulSet / DaemonSet / Job / CronJob — Controllers Beyond Deployment

Infrastructure Kubernetes StatefulSet DaemonSet Job CronJob

Wednesday, April 22, 2026

16 min read

The first post of the K8s Intermediate series. The Deployment we covered in Basics is a controller built around one pattern — “keep N identical Pods up.” But production clusters always have workloads Deployment can’t handle. This post covers the four controllers that fill those gaps — StatefulSet, DaemonSet, Job, CronJob — in one pass. For each, we start with “why Deployment doesn’t work,” then walk through the manifest and operational caveats in one cycle.

This series is K8s Intermediate, 7 posts.

#1 StatefulSet / DaemonSet / Job / CronJob — Controllers beyond Deployment ← this post
#2 PV / PVC / StorageClass — the persistent data model
#3 Ingress and Ingress Controller — the external entry point
#4 resources.requests / limits — Pod resource requests and limits
#5 Health checks — liveness / readiness / startup probes
#6 Autoscaling — HPA / VPA / Cluster Autoscaler
#7 RBAC / NetworkPolicy / ResourceQuota — security and resource policy

Tip

The hands-on posts in this series have you write YAML manifests by hand. One misplaced indent or quote sends kubectl apply into an error that points away from the real cause, leaving you to trace it back from the cluster side. Pasting the manifest into utilrepo’s YAML validator before applying surfaces syntax errors with line and column numbers. utilrepo is a collection of lightweight web utilities that run in your browser, so secrets never leave your machine, and it also catches multi-document manifests joined by --- and tab-space mixes you’d otherwise miss.

Workloads Deployment can’t express #

The mental model of Deployment from Basics #4, in one line — keep N copies of the same Pod template up at all times, and replace them gradually when a new version arrives. This works well for stateless workloads — web servers, API servers, worker queue consumers — where Pods don’t have to be distinguished from each other. Whether it’s web-abc123-aa11 or web-abc123-bb22, the same code runs, and if one Pod dies another Pod takes over.

There are four patterns this model doesn’t fit:

Workloads where Pods must be assumed different from each other — primary and replicas in a database cluster, broker-0 / broker-1 / broker-2 in Kafka. Each Pod needs its own identity and its own disk. Deployment Pods get random names and don’t share disks.
Workloads that must run exactly one per node — log shippers, node monitoring agents, CNI (Container Network Interface) agents. What you need is “match the node count automatically,” not “a replicas count” — and Deployment’s replicas field can’t express that intent.
Workloads that should run once and finish — DB migrations, one-shot data reports, cluster setup scripts. Deployment tries to restart a Pod when it terminates, but for these jobs finishing is the goal.
Workloads that should run periodically — nightly backups, hourly cleanups, weekly reports. Cron-style scheduling has to live at the controller layer.

K8s splits these four into separate controllers — StatefulSet, DaemonSet, Job, CronJob. We’ll walk through each.

StatefulSet — for workloads that need identity and disks #

Try to run a database on K8s with Deployment and you hit a wall immediately. When a PostgreSQL primary dies and a new Pod comes up, that new Pod has to inherit the previous Pod’s data directory. A randomized name won’t do, and how the replicas address the primary needs to stay stable. Deployment guarantees none of these three.

StatefulSet solves three things:

Stable Pod names — Pods get indexed names like <name>-0, <name>-1, <name>-2. They keep the same index across restarts. If web-0 dies and comes back, it’s web-0 again.
A 1:1 persistent volume per Pod — PVCs declared via volumeClaimTemplates are created automatically per Pod. web-0 gets a data-web-0 PVC, web-1 gets data-web-1, and that mapping survives the Pod’s lifecycle. The PV / PVC model itself is covered in depth in #2.
Sequential lifecycle — by default, Pods are created in order from index 0; termination runs in reverse (from N-1). Rolling updates follow the same order. The model fits topologies where the primary has to be up before replicas can attach.

Pairs with a Headless Service #

A StatefulSet is usually created together with a headless Service, because each Pod needs a stable DNS name.

web-headless.yaml

apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  clusterIP: None
  selector:
    app: web
  ports:
    - port: 80
      targetPort: 80

The key is the one line clusterIP: None. This Service doesn’t take a virtual IP; instead, it creates an individual DNS record per Pod. Inside the cluster you can call each Pod by name:

DNS for StatefulSet Pods

web-0.web.default.svc.cluster.local
web-1.web.default.svc.cluster.local
web-2.web.default.svc.cluster.local

The form is <pod>.<headless-service>.<namespace>.svc.cluster.local. If a regular ClusterIP Service is “a virtual IP in front of multiple Pods,” a headless Service is “a stable name tag per Pod.”

The StatefulSet manifest #

The StatefulSet manifest that pairs with the headless Service above:

web-statefulset.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  serviceName: web
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: nginx
          image: nginx:1.27
          ports:
            - containerPort: 80
          volumeMounts:
            - name: data
              mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 1Gi

Three things are different from Deployment:

spec.serviceName: web — points to the headless Service we made above. This is where StatefulSet registers its Pods’ DNS records.
spec.volumeClaimTemplates — a template that auto-creates one PVC per Pod. The manifest above creates three PVCs (data-web-0, data-web-1, data-web-2) and mounts each to /usr/share/nginx/html on the corresponding Pod. Which actual disk a PVC lands on is decided by StorageClass’s dynamic provisioning — the topic of #2.
replicas and Pod names — same replicas: 3 as Deployment, but the Pod names are pinned: web-0, web-1, web-2. There’s no intermediate ReplicaSet object either.

After applying the StatefulSet

kubectl get pods,pvc -l app=web

Example output

NAME        READY   STATUS    RESTARTS   AGE
pod/web-0   1/1     Running   0          1m
pod/web-1   1/1     Running   0          50s
pod/web-2   1/1     Running   0          40s

NAME                               STATUS   VOLUME   CAPACITY   AGE
persistentvolumeclaim/data-web-0   Bound    pvc-...  1Gi        1m
persistentvolumeclaim/data-web-1   Bound    pvc-...  1Gi        50s
persistentvolumeclaim/data-web-2   Bound    pvc-...  1Gi        40s

Pods come up staggered in the order 0, 1, 2, and you can see one PVC per Pod.

One operational caveat — PVCs survive scale-down #

Scale a StatefulSet from replicas: 3 down to replicas: 1 and Pods web-1, web-2 terminate, but the PVCs data-web-1, data-web-2 stick around. This is intentional — a safety net so you don’t accidentally lose data. Scale back up to replicas: 3 and the new web-1, web-2 re-mount those PVCs and see the previous data intact.

To clean up the PVCs you have to delete them explicitly:

Clean up the PVCs too

kubectl delete pvc data-web-1 data-web-2

That safety net means the data survives even when somebody accidentally scales a StatefulSet down. K8s 1.27+ lets you change this behavior with spec.persistentVolumeClaimRetentionPolicy, but for data preservation, leaving the default in place is safer.

DaemonSet — exactly one per node #

Production clusters always have workloads where you need to “look at each node’s state from inside that node.” Fluent Bit collecting container logs and shipping them centrally; Node Exporter measuring CPU / memory / disk and exposing it to Prometheus; CNI agents (Calico, Cilium) wiring up Pod networking. What these have in common — they should run as many copies as there are nodes.

Deployment’s replicas: N can’t express that intent. Every time the node count changes, someone has to update N by hand, and there’s no way to prevent two copies of the same Pod landing on one node, or none on another.

DaemonSet solves it cleanly — run exactly one Pod on each node in the cluster. When a new node joins, one starts on it automatically; when a node leaves, its Pod goes with it.

The DaemonSet manifest #

The biggest difference is no replicas field.

node-exporter-daemonset.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
    spec:
      hostNetwork: true
      containers:
        - name: node-exporter
          image: prom/node-exporter:v1.8.2
          args:
            - --path.rootfs=/host
          ports:
            - containerPort: 9100
              hostPort: 9100
          volumeMounts:
            - name: rootfs
              mountPath: /host
              readOnly: true
      volumes:
        - name: rootfs
          hostPath:
            path: /

Same selector + template shape as Deployment, but no replicas. The count is decided by the node count. hostNetwork: true and hostPath volumes are common patterns in DaemonSet workloads — many of these workloads have to expose Pods on the node’s network interface directly, or look directly at the node’s filesystem.

Check the DaemonSet

kubectl get ds -n monitoring
kubectl get pods -n monitoring -o wide

Example output

NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
node-exporter   3         3         3       3            3           <none>          2m

NAME                  READY   STATUS    RESTARTS   AGE   IP           NODE
node-exporter-7xk2p   1/1     Running   0          2m    10.0.0.11    node-1
node-exporter-9mn4v   1/1     Running   0          2m    10.0.0.12    node-2
node-exporter-bc8qr   1/1     Running   0          2m    10.0.0.13    node-3

The point is that DESIRED 3 is auto-determined by the node count. Add another node and DESIRED flips to 4 and a new Pod starts on that node automatically.

Targeting only some nodes — nodeSelector / tolerations #

A DaemonSet by default puts a Pod on every worker node. In practice you often want only some — only GPU monitors on GPU nodes, or no workloads on control plane nodes.

Use nodeSelector to limit by node labels:

Only on GPU nodes — excerpt

spec:
  template:
    spec:
      nodeSelector:
        hardware: gpu

Conversely, to also land on tainted nodes (e.g., the control plane), use tolerations:

Also on the control plane — excerpt

spec:
  template:
    spec:
      tolerations:
        - key: node-role.kubernetes.io/control-plane
          operator: Exists
          effect: NoSchedule

A real example — kube-proxy running in the cluster’s kube-system namespace is a DaemonSet. It has tolerations like the above so it lands on every node, including the control plane. Worth checking with kubectl get ds -n kube-system.

When a node is cordoned / drained #

Common commands during node maintenance are kubectl cordon and kubectl drain. cordon blocks new scheduling; drain moves Pods to other nodes. DaemonSet Pods are not moved by drain’s default behavior — being one-per-node is their job, so moving one to another node has no meaning. When drain stops because of DaemonSet Pods, the standard pattern is to add --ignore-daemonsets.

Node maintenance — ignore DaemonSets

kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data

Job — work that runs once and finishes #

DB schema migrations, a one-shot data integrity check, a cluster initialization script. When this kind of work finishes, it’s done. What happens if you run a migration container as a Deployment? The moment the container exits cleanly (exit 0), Deployment assumes something went wrong and restarts it. The migration runs in an infinite loop — that’s the incident waiting to happen.

Job is the controller for this scenario. The model is the opposite of Deployment — a successful Pod termination is the normal outcome.

The Job manifest #

db-migration-job.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration
spec:
  completions: 1
  parallelism: 1
  backoffLimit: 4
  activeDeadlineSeconds: 600
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: migrator
          image: myapp/migrator:1.4.0
          command: ["./migrate.sh"]
          env:
            - name: DB_HOST
              value: postgres.default.svc.cluster.local

apiVersion: batch/v1 is new. The Deployment family is apps/v1, but Job / CronJob are in a separate group. Key fields, one line each:

completions: 1 — the number of times a Pod must terminate successfully. The example above is 1 and done. Set to N to process a large dataset split into N pieces.
parallelism: 1 — the number of Pods up at once. With completions: 10 and parallelism: 3, 10 items are processed 3 at a time in parallel.
backoffLimit: 4 — the max number of Pod retries on failure. Default is 6. If exceeded, the Job itself ends up Failed.
activeDeadlineSeconds: 600 — wall-clock cap for the entire Job. If it doesn’t finish within 600 seconds, the Pod is force-terminated. A safety net for migrations stuck in an infinite loop.

restartPolicy is restricted #

Pod’s restartPolicy usually has three values — Always, OnFailure, Never — but Job’s Pod template doesn’t allow Always. apiserver rejects the manifest if you write Always.

The reason is simple. Always means “restart the Pod no matter how it ended (success or failure),” but Job is a workload that expects to terminate. Allowing Always would mean restarting even on success and would erase the meaning of Job. So only OnFailure (retry on failure only) or Never (never retry, but create a fresh Pod) are allowed.

The two differ subtly — OnFailure restarts the container inside the same Pod, while Never marks that Pod as failed and creates a new Pod from scratch. If you want logs preserved for debugging, Never is usually the pick; if you want fast retries, OnFailure.

Watching the Job run #

Create the Job and watch progress

kubectl apply -f db-migration-job.yaml
kubectl get jobs
kubectl get pods --selector=job-name=db-migration

Example output — in progress

NAME           COMPLETIONS   DURATION   AGE
db-migration   0/1           20s        20s

NAME                  READY   STATUS    RESTARTS   AGE
db-migration-xkz2p    1/1     Running   0          20s

Example output — completed

NAME           COMPLETIONS   DURATION   AGE
db-migration   1/1           45s        2m

NAME                  READY   STATUS      RESTARTS   AGE
db-migration-xkz2p    0/1     Completed   0          2m

COMPLETIONS 1/1 and the Pod ending in Completed is the shape of normal termination. Logs come straight back via kubectl logs db-migration-xkz2p. Job objects remain in the cluster until you explicitly run kubectl delete job db-migration — useful for auditing history, or add ttlSecondsAfterFinished to auto-clean them.

CronJob — periodic execution #

DB backup at 3 AM every day, hourly cleanup of temp files, weekly stats report on Monday morning. That pattern is CronJob. The model is straightforward — create a Job object on a cron schedule. Think of it as a cron scheduler stacked on top of Job.

The CronJob manifest #

db-backup-cronjob.yaml

apiVersion: batch/v1
kind: CronJob
metadata:
  name: db-backup
spec:
  schedule: "0 3 * * *"
  timeZone: "Asia/Seoul"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1
  startingDeadlineSeconds: 300
  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 1800
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: backup
              image: myapp/backup:2.1.0
              command: ["/usr/local/bin/backup.sh"]
              env:
                - name: S3_BUCKET
                  value: my-backups

The CronJob manifest has two layers — the outer spec’s scheduling fields and the inner jobTemplate’s Job definition. The inner jobTemplate is shaped exactly like the spec of the Job manifest above.

The outer fields, one line each:

schedule: "0 3 * * *" — a standard 5-field cron expression. In order: minute hour day month dow. The example above is daily at 3:00 AM. Standard cron syntax like */15 * * * * (every 15 minutes) or 0 9 * * 1-5 (weekday mornings at 9) works as-is.
timeZone: "Asia/Seoul" — stable since 1.27. Before that, CronJob times followed the timezone of the control plane components and were typically interpreted as UTC, leading to incidents like “why is my 3 AM backup running at noon?” Specifying this field removes that ambiguity.
concurrencyPolicy — what to do when a previous run is still going at the next scheduled time. Default is Allow.
successfulJobsHistoryLimit / failedJobsHistoryLimit — how many successful / failed Job objects to keep around in the cluster. Defaults are 3 and 1. Setting them too high accumulates Jobs in etcd.
startingDeadlineSeconds: 300 — if a run hasn’t started within this many seconds after its scheduled time, skip it. A safety net against the failure mode where the control plane was paused, then recovers, and tries to fire all the missed runs at once.

concurrencyPolicy — three choices #

Leaving the default Allow makes ops mistakes easy. The three options behave clearly differently.

Policy	Behavior
`Allow` (default)	Even if the previous run isn’t done, create a new Job for this run. Multiple can run at once.
`Forbid`	If the previous run isn’t done, skip this run.
`Replace`	Kill the previous run’s Job and replace it with this run.

For workloads like DB backups where two of the same shouldn’t touch the same data — Forbid is the right answer. If a backup takes 30 minutes and the schedule is hourly, Allow means new backups pile up every hour. For “only the latest run needs to be alive” workloads (e.g., cache warming), Replace is right.

The risk without startingDeadlineSeconds #

A subtle CronJob trap is startingDeadlineSeconds. Without it (or with a very large value), if the control plane is paused for a while and then recovers, it may try to fire all the missed runs at once. A CronJob that runs every minute, paused for an hour, can produce 60 Jobs simultaneously when it wakes up.

In production, almost always set startingDeadlineSeconds to a sensible value (e.g., 300 seconds). Skipping a run that didn’t start in time is in almost every case better than firing 60 at once on wake-up.

Watching CronJob #

CronJob and the Jobs / Pods underneath

kubectl get cronjob,jobs,pods

Example output — after one run

NAME                      SCHEDULE      TIMEZONE      LAST SCHEDULE   AGE
cronjob.batch/db-backup   0 3 * * *     Asia/Seoul    8h              2d

NAME                            COMPLETIONS   DURATION   AGE
job.batch/db-backup-29345400    1/1           14m        8h
job.batch/db-backup-29346840    1/1           13m        20m

NAME                                  READY   STATUS      RESTARTS   AGE
pod/db-backup-29346840-7kxqr          0/1     Completed   0          20m

You can see the three layers — one CronJob, a Job per run underneath, and one Pod per Job that ran once and ended Completed. Completed Jobs are retained up to the successfulJobsHistoryLimit count for postmortem debugging.

When to use which controller #

Adding Deployment from Basics to the four here, the five controllers in one table.

Controller	Best for	Pod identifier	Termination model
Deployment	Stateless web/API servers, worker consumers	Random (`web-abc-aa11`)	Restart on death
StatefulSet	DBs, message queue brokers, distributed caches	`web-0`, `web-1` (fixed)	Restart with the same index
DaemonSet	Node agents, log shippers, CNI	One per node	Restart on death
Job	DB migrations, one-shot batches	Random	Done when it succeeds
CronJob	Periodic backups, cleanups, reports	A Job per run	Each run follows the Job model

The mental decision tree is simple:

Can Pods be identical? — if not, StatefulSet; if yes, next question.
Must there be exactly one per node? — if yes, DaemonSet; if no, next question.
Should it run once and finish? — if yes, CronJob for periodic, Job for one-shot. Otherwise, Deployment.

Once you know these four controllers, you can read the intent of any kind: in a cluster’s manifest directory in one line.

Summary #

What this post pinned down:

Deployment sits on the stateless assumption — Pods are interchangeable, restart on death, simple model. Identity, per-node, one-shot, periodic workloads need other controllers.
StatefulSet — serviceName (headless Service) and volumeClaimTemplates are the core. Pod names like <name>-0, <name>-1 are stable; each Pod owns its PVC. PVCs survive scale-down.
DaemonSet — no replicas. Auto-matches the node count, one per node. nodeSelector / tolerations limit it to specific nodes; kube-proxy is the canonical example.
Job — apiVersion: batch/v1. completions, parallelism, backoffLimit, activeDeadlineSeconds shape behavior. restartPolicy is restricted to OnFailure / Never.
CronJob — a cron scheduler on top of Job. The 5-field schedule, timeZone (1.27+), concurrencyPolicy (Allow / Forbid / Replace), and startingDeadlineSeconds to prevent missed-run pile-ups.
The decision tree across the five — three questions about Pod identity, per-node need, and termination expectation.

Next — PV / PVC / StorageClass #

This post mentioned in one line that StatefulSet’s volumeClaimTemplates auto-creates PVCs, but didn’t cover what disks they actually map to or how. In production, behind that one line is the triangle of PV (PersistentVolume), PVC (PersistentVolumeClaim), StorageClass — how a Pod’s lifecycle separates from a disk’s lifecycle, how disks get dynamically provisioned, what the differences between accessModes (ReadWriteOnce, ReadOnlyMany, ReadWriteMany) actually mean, and how reclaimPolicy decides what happens to the disk when a PVC disappears.

#2 PV / PVC / StorageClass — the persistent data model sorts out the relationships between these three objects and walks through what StatefulSet’s volumeClaimTemplates actually produces on top of them, in one cycle.