Contents
8 Chapter

StatefulSet / DaemonSet / Job / CronJob

A walkthrough of the controllers that handle the four kinds of workload Deployment's stateless assumption cannot express. StatefulSet's identity and 1:1 PVC, DaemonSet's one-per-node model, Job's termination model, and CronJob's cron scheduling with the concurrencyPolicy · startingDeadlineSeconds safeguards.

This is the first chapter of Part 2 (Workloads and Operations). The Deployment from Chapter 4, Deployment and ReplicaSet is a controller for stateless workloads. It assumes that several copies of the same Pod are interchangeable, and that if one disappears you simply bring it back up — a simple model. But a DB that needs identity and a disk, an agent that must run exactly one per node, a migration that should run once and finish, and a backup that runs every day cannot be expressed with Deployment. This chapter brings together the four controllers that fill that gap: StatefulSet, DaemonSet, Job, and CronJob.

By the end of this chapter you’ll have a decision tree that lets you read the intent of any kind: in a cluster’s manifest directory in a single line. We follow each controller from the problem of “why doesn’t Deployment work here,” through a single manifest, to the operational caveats.

Workloads Deployment can’t express #

If we shrink the mental model of Deployment from Chapter 4 to one line, it’s this — keep N copies of the same Pod template alive at all times, and when a new version arrives, replace them gradually. The workloads this model fits well are stateless web servers, API servers, and worker queue consumers — cases where the Pods don’t need to be distinguished from one another. Whether it’s web-abc123-aa11 or web-abc123-bb22, the same code runs, and if any Pod dies another Pod fills its role.

There are four patterns this model doesn’t handle well.

  • Workloads where you must assume the Pods are different from one another — the primary and replica of a database cluster, or Kafka’s broker-0 / broker-1 / broker-2, where each Pod must have its own identity and its own disk. The Pods Deployment creates have arbitrary names, and their disks are not shared.
  • Workloads that must run exactly one per node — log collectors, node monitoring agents, CNI (Container Network Interface) agents. What you need is not “a replica count” but “automatically match the number of nodes,” and Deployment’s replicas field can’t express that intent.
  • Workloads that should run once and finish — DB migrations, one-off data reports, cluster setup scripts. Deployment tries to bring a Pod back up when it terminates, but for this kind of work, finishing is the normal outcome.
  • Workloads that should run periodically — a nightly backup, an hourly cleanup, a weekly report. cron-like scheduling needs to live at the controller level.

These four are exactly what K8s has split into separate controllers: StatefulSet, DaemonSet, Job, and CronJob. Let’s look at them one at a time.

StatefulSet — workloads that need identity and a disk #

When you try to run a database on K8s, the wall Deployment first hits is clear. When a PostgreSQL primary dies and a new Pod comes up, that new Pod must take over the previous Pod’s data directory unchanged. It’s a problem if the name changes to an arbitrary value, and how the other replicas address the primary must be stable too. Deployment guarantees none of these three.

StatefulSet solves these three things.

  • Stable Pod names — Pods get indexed names in the form <name>-0, <name>-1, <name>-2. Even when a Pod restarts, it keeps the same index. If web-0 dies and comes back, it’s web-0 again.
  • A 1:1 persistent volume per Pod — the PVC written in volumeClaimTemplates is created automatically for each Pod. web-0 gets the data-web-0 PVC, web-1 gets the data-web-1 PVC, and that mapping persists across the Pod’s lifecycle. The PV / PVC model itself is covered in depth in Chapter 9, PV / PVC / StorageClass.
  • A sequential lifecycle — by default, Pods are created in order starting from index 0, and termination proceeds in reverse order (from N-1 down). Rolling updates follow the same order too. It’s a model fitted to topologies where the primary must come up first before a replica can attach.

It pairs with a headless Service #

A StatefulSet is usually created paired with a headless Service, because each Pod needs a stable DNS name. The concept of a headless Service itself was already noted in one line in Chapter 5, Service, in the §“Service types in one table.”

web-headless.yaml
apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  clusterIP: None
  selector:
    app: web
  ports:
    - port: 80
      targetPort: 80

The key is the single line clusterIP: None. This Service does not get its own virtual IP; instead it creates an individual DNS record per Pod. From inside the cluster you can address each Pod directly by the following names.

DNS of StatefulSet Pods
web-0.web.default.svc.cluster.local
web-1.web.default.svc.cluster.local
web-2.web.default.svc.cluster.local

The form is <pod>.<headless-service>.<namespace>.svc.cluster.local. If an ordinary ClusterIP Service is “a virtual IP in front of several Pods,” then a headless Service is “an issuer of stable name tags for each Pod.”

The StatefulSet manifest #

Here is the StatefulSet manifest that is applied alongside the headless Service above.

web-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  serviceName: web
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: nginx
          image: nginx:1.27
          ports:
            - containerPort: 80
          volumeMounts:
            - name: data
              mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 1Gi

There are three parts that differ from Deployment.

  • spec.serviceName: web — points to the name of the headless Service created above. It’s the field that tells the StatefulSet where to register its Pods’ DNS records.
  • spec.volumeClaimTemplates — a template that automatically creates a PVC per Pod. The manifest above creates three PVCs, data-web-0, data-web-1, and data-web-2, and mounts each at the Pod’s /usr/share/nginx/html. Which actual disk these PVCs connect to is decided by the StorageClass’s dynamic provisioning, and this whole flow is the main subject of Chapter 9.
  • replicas and Pod names — it’s the same replicas: 3 as Deployment, but the Pod names created are fixed to web-0, web-1, web-2. There is no intermediate ReplicaSet object either.
after applying the StatefulSet
kubectl get pods,pvc -l app=web
example output
NAME        READY   STATUS    RESTARTS   AGE
pod/web-0   1/1     Running   0          1m
pod/web-1   1/1     Running   0          50s
pod/web-2   1/1     Running   0          40s

NAME                               STATUS   VOLUME   CAPACITY   AGE
persistentvolumeclaim/data-web-0   Bound    pvc-...  1Gi        1m
persistentvolumeclaim/data-web-1   Bound    pvc-...  1Gi        50s
persistentvolumeclaim/data-web-2   Bound    pvc-...  1Gi        40s

You can see the Pods came up in order 0, 1, 2 with time gaps between them, and a PVC was created for each Pod individually.

One operational caveat — PVCs remain on scale-down #

If you reduce a StatefulSet from replicas: 3 to replicas: 1, the Pods web-1 and web-2 are terminated, but the PVCs data-web-1 and data-web-2 remain. This is intended behavior — a safeguard to keep you from accidentally losing data. If you scale back up to replicas: 3, the freshly recreated web-1 and web-2 remount those PVCs and see the previous data unchanged.

To clean up the PVCs as well, you have to delete them explicitly.

cleaning up the PVCs too
kubectl delete pvc data-web-1 data-web-2

Thanks to this safeguard, even if an operational mishap reduces a StatefulSet’s replicas by mistake, the data stays alive. From K8s 1.27 you can change this behavior with spec.persistentVolumeClaimRetentionPolicy, but from a data-preservation standpoint, leaving the default unchanged is the safer choice.

The pattern of running a stateful workload like a DB directly on K8s in production is covered once more in Chapter 18, CRD and the Operator pattern through the Operator model (e.g., CloudNativePG, the Zalando Postgres Operator). Because a single StatefulSet alone makes it hard to operate backup · failover · recovery, you usually stack one more domain controller on top of it.

DaemonSet — exactly one per node #

In a production cluster there are workloads that “must look into each node’s state from within that node.” Fluent Bit, which gathers a node’s container logs and ships them to a central place; Node Exporter, which measures a node’s CPU · memory · disk and exposes them to Prometheus; the CNI agents (Calico, Cilium, etc.) that build the network between Pods. What these workloads have in common is that there should be exactly as many of them as there are nodes.

Deployment’s replicas: N can’t express this intent. Every time the node count goes up or down, a person has to adjust N by hand, and you can’t prevent situations where two of the same Pod come up on one node or a node has none at all.

What DaemonSet solves is simple — it runs exactly one of its Pods on each node in the cluster. When a new node joins the cluster it automatically runs one on that node too, and when a node leaves, that node’s Pod disappears with it.

The DaemonSet manifest #

The biggest difference is that there is no replicas field.

node-exporter-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
    spec:
      hostNetwork: true
      containers:
        - name: node-exporter
          image: prom/node-exporter:v1.8.2
          args:
            - --path.rootfs=/host
          ports:
            - containerPort: 9100
              hostPort: 9100
          volumeMounts:
            - name: rootfs
              mountPath: /host
              readOnly: true
      volumes:
        - name: rootfs
          hostPath:
            path: /

It’s the same selector + template structure as Deployment, but there’s no replicas. The count is decided by the number of nodes. hostNetwork: true and the hostPath volume are patterns you see often in DaemonSet workloads — many of these workloads need to expose a Pod directly via the node’s network interface, or to look directly into the node’s filesystem.

checking the DaemonSet
kubectl get ds -n monitoring
kubectl get pods -n monitoring -o wide
example output
NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
node-exporter   3         3         3       3            3           <none>          2m

NAME                  READY   STATUS    RESTARTS   AGE   IP           NODE
node-exporter-7xk2p   1/1     Running   0          2m    10.0.0.11    node-1
node-exporter-9mn4v   1/1     Running   0          2m    10.0.0.12    node-2
node-exporter-bc8qr   1/1     Running   0          2m    10.0.0.13    node-3

The key is that DESIRED 3 is a value decided automatically by the node count. Add one more node and it changes to DESIRED 4, and a new Pod comes up on that node automatically.

Running on only some nodes — nodeSelector / tolerations #

A default DaemonSet runs a Pod on every worker node. But in operations it’s common to want to run on only some nodes — running a GPU monitor only on GPU-equipped nodes, or keeping workloads off the control-plane nodes.

With nodeSelector you can limit it to nodes that match a node label.

run only on GPU nodes — excerpt
spec:
  template:
    spec:
      nodeSelector:
        hardware: gpu

Conversely, to also run on tainted nodes (e.g., the control plane), you write tolerations.

also run on control-plane nodes — excerpt
spec:
  template:
    spec:
      tolerations:
        - key: node-role.kubernetes.io/control-plane
          operator: Exists
          effect: NoSchedule

In fact, the kube-proxy running in your cluster’s kube-system namespace is a DaemonSet. Because it must run on every node including the control-plane nodes, it carries a toleration like the one above. It’s worth checking once with kubectl get ds -n kube-system.

When a node is cordoned / drained #

The commands you commonly use when servicing a node in operations are kubectl cordon and kubectl drain. cordon only blocks the scheduling of new Pods, while drain moves the Pods on the node to other nodes. DaemonSet Pods are not moved under drain’s default behavior — since their whole purpose is to run one per node, there’s no point moving them to another node. When a drain command stalls because of DaemonSet Pods, the standard pattern is to add the --ignore-daemonsets flag.

servicing a node — ignore DaemonSets
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data

The safe usage pattern for the node-upgrade flow is covered in Chapter 30, Upgrade strategy, together with PodDisruptionBudget · terminationGracePeriodSeconds.

Job — work that runs once and finishes #

A DB schema migration, a one-off data-consistency check, the initial setup script for a new cluster. For this kind of work, finishing is the end of it. But what happens if you run a migration container with a Deployment manifest? The moment the container terminates normally (exit 0), Deployment goes “why did it die?” and brings it back up. The migration becomes an accident that loops forever.

Job is the controller for this scenario. It’s the exact opposite model from Deployment in that it treats a Pod terminating successfully as the normal outcome.

The Job manifest #

db-migration-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration
spec:
  completions: 1
  parallelism: 1
  backoffLimit: 4
  activeDeadlineSeconds: 600
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: migrator
          image: myapp/migrator:1.4.0
          command: ["./migrate.sh"]
          env:
            - name: DB_HOST
              value: postgres.default.svc.cluster.local

What’s new is that apiVersion is batch/v1. The Deployment family was apps/v1, but Job / CronJob are a separate group. Let’s note the key fields one line each.

  • completions: 1 — the number of times a Pod must terminate successfully. The example above is done after one. When you split large data into N pieces to process, you set it to N.
  • parallelism: 1 — the number of Pods running at the same time. With completions: 10 and parallelism: 3, it processes 10 but runs 3 at a time in parallel.
  • backoffLimit: 4 — the upper bound on retry count when a Pod fails. The default is 6. Once this count is exceeded, the Job itself is closed out as Failed.
  • activeDeadlineSeconds: 600 — the time limit for the whole Job. If it doesn’t finish within 600 seconds, the Pod is force-terminated. It’s a safeguard that cuts off a migration stuck in an infinite loop.

The constraint on restartPolicy #

A Pod’s restartPolicy usually has three options, Always, OnFailure, and Never, but Always is not allowed in a Job’s Pod template. If you write Always in the manifest, the apiserver rejects it.

The reason is simple. Always means “bring the Pod back up no matter how it ends (success or failure),” but a Job is a workload that expects to terminate. Allowing Always would bring it back up even on success, which erases the meaning of a Job. So you can use only one of OnFailure (retry only on failure) or Never (never retry, recreate with a new Pod).

The difference between the two is subtle — OnFailure restarts just the container inside the same Pod, while Never marks that Pod as failed and recreates a new Pod. If you want to preserve logs for debugging, Never is the usual choice; if you want fast retries, OnFailure is.

Checking how a Job behaves #

creating a Job and checking progress
kubectl apply -f db-migration-job.yaml
kubectl get jobs
kubectl get pods --selector=job-name=db-migration
example output — in progress
NAME           COMPLETIONS   DURATION   AGE
db-migration   0/1           20s        20s

NAME                  READY   STATUS    RESTARTS   AGE
db-migration-xkz2p    1/1     Running   0          20s
example output — after completion
NAME           COMPLETIONS   DURATION   AGE
db-migration   1/1           45s        2m

NAME                  READY   STATUS      RESTARTS   AGE
db-migration-xkz2p    0/1     Completed   0          2m

COMPLETIONS 1/1 showing up and the Pod closing out as Completed is the shape of normal termination. You can get the migration output directly with kubectl logs db-migration-xkz2p. A Job stays in the cluster unless you clean it up explicitly with kubectl delete job db-migration — leave it if you want to keep it around as history, or add ttlSecondsAfterFinished to have it cleaned up automatically.

CronJob — periodic execution #

A DB backup every day at 3 a.m., temp-file cleanup every hour on the hour, a statistics report every Monday morning. This pattern is CronJob. The model is simple — it creates a Job object at the times set by a cron expression. It’s the shape of one more cron-scheduler layer stacked on top of Job.

The CronJob manifest #

db-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: db-backup
spec:
  schedule: "0 3 * * *"
  timeZone: "Asia/Seoul"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1
  startingDeadlineSeconds: 300
  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 1800
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: backup
              image: myapp/backup:2.1.0
              command: ["/usr/local/bin/backup.sh"]
              env:
                - name: S3_BUCKET
                  value: my-backups

The key to a CronJob manifest is two layers — the scheduling fields in the outer spec, and the Job definition in the inner jobTemplate. The inner jobTemplate looks exactly like the spec of the Job manifest we saw above.

Let’s note the key fields of the outer layer.

  • schedule: "0 3 * * *" — a standard cron expression of 5 fields. In order they are minute hour day-of-month month day-of-week. This example is every day at 3 a.m. on the dot. You use ordinary cron syntax directly, like */15 * * * * (every 15 minutes) or 0 9 * * 1-5 (9 a.m. on weekdays).
  • timeZone: "Asia/Seoul" — a field stabilized from 1.27. Before that, a CronJob’s time followed the control-plane component’s timezone and was commonly interpreted as UTC, and accidents like “why does the 3 a.m. backup run at noon” were frequent. Specifying this field removes that ambiguity.
  • concurrencyPolicy — the policy for when the previous run’s Job hasn’t finished yet but the time for a new run arrives. The default is Allow.
  • successfulJobsHistoryLimit / failedJobsHistoryLimit — set how many successful · failed Job objects to keep in the cluster. The defaults are 3 and 1 respectively. Set too large, Jobs accumulate in etcd.
  • startingDeadlineSeconds: 300 — if a run can’t start within this many seconds after its scheduled time, that run is skipped. It’s a safeguard that prevents the accident of running all the backlogged runs at once when the control plane briefly stops and then recovers.

The three concurrencyPolicy options #

Leaving the default Allow unchanged makes operational accidents easy. The behavior of the three options differs clearly.

PolicyBehavior
Allow (default)Creates a new run’s Job even if the previous run’s Job hasn’t finished. Several can be running at the same time
ForbidIf the previous run hasn’t finished, this run is skipped
ReplaceKills the previous run’s Job and replaces it with the new run

For a workload like a DB backup where two things must not touch the same data at once, Forbid is the right answer. If the previous backup takes 30 minutes and the schedule is every hour on the hour, leaving it at Allow causes an accident where a new backup is added every hour and they pile up. For a workload where “only the latest run needs to be alive” (e.g., cache warming), Replace is the fit.

The risk when startingDeadlineSeconds is missing #

One subtle trap of CronJob is startingDeadlineSeconds. If this field is missing or set too large, and the control plane is stopped for a long while and then recovers, an attempt to run all the backlogged runs at once can occur. A CronJob that runs every minute, stopped for an hour, then waking up, ends up creating 60 Jobs at once.

For CronJobs in a production cluster, it’s safe to almost always set startingDeadlineSeconds to a reasonable value (e.g., 300 seconds). If a run couldn’t start within that window, simply skipping that run is, in nearly every case, better than running 60 at once when it wakes up.

Checking how a CronJob behaves #

the CronJob and the Jobs and Pods beneath it
kubectl get cronjob,jobs,pods
example output — after one run
NAME                      SCHEDULE      TIMEZONE      LAST SCHEDULE   AGE
cronjob.batch/db-backup   0 3 * * *     Asia/Seoul    8h              2d

NAME                            COMPLETIONS   DURATION   AGE
job.batch/db-backup-29345400    1/1           14m        8h
job.batch/db-backup-29346840    1/1           13m        20m

NAME                                  READY   STATUS      RESTARTS   AGE
pod/db-backup-29346840-7kxqr          0/1     Completed   0          20m

You can see the three-tier shape — there’s one CronJob, under it a Job object is created for each run, and under each Job a Pod comes up once and closes out as Completed. Even after a Job finishes, as many objects as successfulJobsHistoryLimit remain so you can use them for post-hoc debugging.

When to use which controller #

Adding Part 1’s Deployment, we organize the five controllers in one table.

ControllerSuitable workloadPod identifierTermination model
Deploymentstateless web · API server, worker consumerarbitrary value (web-abc-aa11)brought back up when it dies
StatefulSetDB, message queue broker, distributed cacheweb-0, web-1 (fixed)brought back up with the same index when it dies
DaemonSetnode agent, log collector, CNIone per nodebrought back up when it dies
JobDB migration, one-off batcharbitrary valuedone once it ends successfully
CronJobperiodic backup, cleanup, reporta Job per runeach run is Job’s termination model

The mental decision tree is simple.

  • Can the Pods be interchangeable with one another? — if not, it’s StatefulSet; if so, go to the next question.
  • Must exactly one run per node? — if so, it’s DaemonSet; if not, go to the next question.
  • Should it run once and finish? — if so, it’s CronJob if periodic and Job if one-off; otherwise it’s Deployment.

Once you know these four controllers, you can read the intent of any kind: in a cluster’s manifest directory in a single line.

Exercises #

  1. Following the body above, apply web-headless.yaml (the headless Service) and web-statefulset.yaml with replicas: 3, then check with kubectl get pods,pvc -l app=web how the Pods and PVCs are paired. Next, force-delete the middle Pod with kubectl delete pod web-1, and record whether the freshly recreated Pod’s name and the mounted PVC are the same. In one paragraph, note how this differs from the self-healing of Chapter 4’s Deployment.
  2. Let’s deliberately make a DB migration Job fail — have it return exit 1 in a way like command: ["false"], set backoffLimit: 2, and kubectl apply. Record the kubectl get pods --selector=job-name=db-migration output in time order to note how the retries happen and how many failures it takes for the Job to finally close out as Failed, matching it against the backoffLimit explanation in §“The Job manifest.”
  3. Change the CronJob manifest’s schedule to */1 * * * * (every minute) and its concurrencyPolicy to Allow / Forbid / Replace one at a time, set something longer than a minute like activeDeadlineSeconds: 90, and apply. For each case, organize in a table how the number of Jobs running at the same time differs in the kubectl get jobs output, and in one paragraph, in your own words, summarize which policy is safe for a workload like a DB backup, matching it against §“The three concurrencyPolicy options.”

In one line: For the four kinds of workload Deployment’s stateless assumption can’t express, K8s has split out four controllers — StatefulSet (identity and 1:1 PVC), DaemonSet (one per node), Job (a model that treats successful termination as normal), and CronJob (a cron scheduler on top of Job). The decision tree splits on three questions: “are the Pods interchangeable / is it per-node / does it expect to terminate.”

Next chapter #

In this chapter we noted in one line that a StatefulSet’s volumeClaimTemplates creates PVCs automatically, but we didn’t cover how those PVCs actually connect to which disk and how. In a production cluster, behind that one line lies the triangle of PV (PersistentVolume), PVC (PersistentVolumeClaim), and StorageClass — how the Pod’s lifecycle is separated from the disk’s lifecycle, how a disk is created dynamically, what the difference is between the accessModes (ReadWriteOnce, ReadOnlyMany, ReadWriteMany), and how reclaimPolicy handles the disk when a PVC disappears.

Chapter 9, PV / PVC / StorageClass lays out the relationship of these three objects and follows what a StatefulSet’s volumeClaimTemplates actually creates on top of them.

X