K8s Intermediate #2: PV / PVC / StorageClass — The Persistent Data Model

Infrastructure Kubernetes PV PVC StorageClass

Thursday, April 23, 2026

18 min read

The second post in the K8s Intermediate series. Through #6 of Basics, what we pulled out of the manifest was config and secrets. Image tags, DB hosts, API keys all became external objects so the workload definition wasn’t tied to an environment. But one dimension still remains — the data itself. A container’s filesystem is ephemeral space that vanishes with the container, while DB data, user uploads, and Prometheus metric time series have to outlive the Pod. This post follows how K8s expresses that persistent disk model through the triangle of PersistentVolume, PersistentVolumeClaim, and StorageClass — in one cycle.

This series is K8s Intermediate, 7 posts.

#1 StatefulSet / DaemonSet / Job / CronJob — Controllers beyond Deployment
#2 PV / PVC / StorageClass — the persistent data model ← this post
#3 Ingress and Ingress Controller — the external entry point
#4 resources.requests / limits — Pod resource requests and limits
#5 Health checks — liveness / readiness / startup probes
#6 Autoscaling — HPA / VPA / Cluster Autoscaler
#7 RBAC / NetworkPolicy / ResourceQuota — security and resource policy

The ephemerality of container filesystems — the starting point #

In #1 we noted “a 1:1 persistent volume per Pod” as one thing StatefulSet solves, but pushed the details of PVC to the next post. This post is that follow-up. Start from first principles — why we need an object called persistent volume.

The default container filesystem is ephemeral space sealed inside the container. When the container terminates, files inside disappear with it. When a container restarts inside the same Pod, even within the same Pod the filesystem starts over fresh. When the Pod itself moves to a different node, that ephemeral space is even more clearly gone. Deployment’s model from Basics #4 was “Pods can die and come back any time” — which is natural for stateless workloads but problematic for workloads that have to hold state.

To keep data alive, you write to a disk outside the Pod’s filesystem. That disk has to outlive the Pod and be remountable when a new Pod comes up. The way K8s expresses that requirement is the separation of PV / PVC / StorageClass into three objects.

Non-persistent volumes like emptyDir exist too #

Not every K8s volume is persistent. There are also volumes like emptyDir that only live as long as the Pod — used for two containers in the same Pod to swap files, or as scratch space for big temp file work. emptyDir disappears when the Pod disappears. This post is about the opposite — disks that survive separately from the Pod’s lifecycle.

The triangle — PV / PVC / StorageClass #

Each object’s responsibility, one line:

Object	What it is	Scope	Who creates it
PersistentVolume (PV)	A representation of the disk itself. A piece of storage in the cluster.	Cluster-scoped	An admin creates it directly, or a StorageClass provisions it dynamically
PersistentVolumeClaim (PVC)	A request that says “give me this much disk in this mode”	Namespace-scoped	An app developer writes it in the manifest
StorageClass (SC)	A blueprint for how to make a PV when a PVC arrives	Cluster-scoped	An admin creates it ahead of time

This separation is the heart of K8s’s persistent data model. App developers only write PVCs — “I need a 5Gi RWO disk.” Which cloud, which disk type satisfies that request lives in the SC, and the actual disk’s representation maps to a PV. Thanks to this, the same manifest can flow to EBS on AWS, PD on GCP, NFS or Ceph on-prem.

A mental picture:

How a PVC binds to a PV

App manifest
   │
   ▼
PVC (5Gi, RWO, storageClassName=fast-ssd)
   │
   ├── (static) Find a matching PV among admin-created ones, Bound
   │
   └── (dynamic) StorageClass(fast-ssd)'s provisioner
              creates a new disk and registers it as a PV → Bound to that PV

Bound means a PVC and a PV are paired 1:1. One PVC binds to one PV; one PV binds to one PVC. A Pod looks at the PVC for mounting, never the PV directly. That single layer of indirection lets the workload manifest stay the same when the disk backend changes.

Static provisioning — the simplest model #

Before dynamic provisioning, walk through the simplest model — an admin creates a PV by hand and a PVC binds to it. This shape isn’t used much in production, but it’s the best starting point to understand the matching rules between PV and PVC.

On local clusters like minikube or kind, hostPath — which borrows a path on the node’s host filesystem as the PV’s backing storage — is common. Not appropriate for production but plenty for learning.

pv-static.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-local-1g
spec:
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: manual
  hostPath:
    path: /mnt/data/pv-local-1g

spec.capacity.storage is the disk size, spec.accessModes is how many can mount in what mode (covered shortly), persistentVolumeReclaimPolicy is what to do with the PV when its PVC disappears. storageClassName: manual is a marker meaning “this PV doesn’t belong to any SC — it was created by hand.” That label is used in matching against PVCs.

The PVC to pair with it:

pvc-static.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: manual

The rules a PVC uses to pick a PV are straightforward. All three must match for binding:

storageClassName matches — both sides above are manual. PVC’s storageClassName: "" (empty string) means “PVs without an SC only”; omitting the field means “follow the cluster’s default SC.”
PV satisfies the PVC’s accessModes — PVC requests RWO and PV supports both RWO/RWX → OK; the reverse (PV only RWO, PVC requests RWX) doesn’t match.
PV’s capacity is at least the PVC’s request — PVC requests 1Gi and PV has 1Gi or more → match. A large PV bound to a small PVC wastes the difference.

Apply and check status

kubectl apply -f pv-static.yaml
kubectl apply -f pvc-static.yaml
kubectl get pv,pvc

Example output

NAME                          CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM           STORAGECLASS   AGE
persistentvolume/pv-local-1g  1Gi        RWO            Retain           Bound    default/data    manual         10s

NAME                         STATUS   VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/data   Bound    pv-local-1g   1Gi        RWO            manual         5s

Both PV and PVC at STATUS: Bound, holding each other’s name in the CLAIM / VOLUME columns — this is the normal shape. Mount this PVC in a Pod and inside the container it looks like a regular directory, but it’s actually mapped to /mnt/data/pv-local-1g on the node.

1:1 explicit binding — claimRef / volumeName #

The matching above is K8s automatically finding by storageClassName + accessModes + capacity. To pin a specific PV to a specific PVC explicitly:

Write the PVC’s namespace + name in the PV’s spec.claimRef
Write the PV’s name in the PVC’s spec.volumeName

When both are present, K8s ignores other matching candidates and binds the 1:1 directly. Rare in production but shows up in migration or disk recovery scenarios.

accessModes — who can mount how #

accessModes expresses the disk’s concurrent mount capability. There are four:

Mode	Short	Meaning
`ReadWriteOnce`	RWO	Mount read-write on one node. Multiple Pods on the same node can mount together
`ReadOnlyMany`	ROX	Mount read-only across multiple nodes simultaneously
`ReadWriteMany`	RWX	Mount read-write across multiple nodes simultaneously
`ReadWriteOncePod`	RWOP	Mount read-write from exactly one Pod across the entire cluster (1.22+ stable)

In production, RWO and RWX are the two you see most often. Which mode is possible depends on the type of backend disk:

Backend	Supported modes	Note
AWS EBS	RWO	Attaches to one node in one AZ
GCP Persistent Disk	RWO (regional supports ROX)	Default is one node
Azure Disk	RWO	Single node
AWS EFS / GCP Filestore / Azure Files	RWX	NFS-based file storage
On-prem NFS	ROX, RWX	File storage
Ceph RBD (block)	RWO	Block
CephFS / GlusterFS	RWX	File

In one line — block storage is RWO, file storage can go up to RWX. Workloads that need RWX (multiple Pods sharing one directory, e.g., WordPress’s uploads dir, a shared cache) need an NFS-class backend. PVCs requesting RWX against block disks will never bind.

ReadWriteOncePod — preventing DB split-brain #

ReadWriteOncePod, stable since 1.22, is a tighter constraint than RWO. RWO is “multiple Pods on one node can mount together”; RWOP is “exactly one Pod across the entire cluster.” Used as a safety net for workloads like databases where two processes touching the same data files concurrently corrupt them — also blocks the accident where another namespace’s Pod on the same node accidentally pulls the same PVC.

StorageClass and dynamic provisioning #

Static provisioning has an obvious overhead — the cluster admin has to create PVs by hand in advance. Every time a new disk is needed, the human creates a disk in the cloud console, writes a PV manifest, and applies it. In production clusters, this shape becomes a bottleneck.

StorageClass fills that gap. Create an SC once up front, and when a PVC referencing that SC arrives, K8s’s provisioner automatically creates the disk, registers it as a PV, and binds it to the PVC. The human is removed from the PV step.

storageclass-fast.yaml — AWS EBS example

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  encrypted: "true"
  iops: "3000"
  throughput: "125"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

A PVC referencing this SC becomes lighter:

pvc-dynamic.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  storageClassName: fast-ssd

Apply this PVC and the following happens automatically:

K8s reads the PVC’s storageClassName and finds the fast-ssd SC.
The SC’s provisioner (ebs.csi.aws.com) is called and a new 5Gi gp3 EBS volume is created.
That EBS volume is registered as a PV object automatically.
The new PV becomes Bound to the PVC.

A short table of provisioners by environment:

Environment	provisioner	Default SC
AWS EKS	`ebs.csi.aws.com` (block), `efs.csi.aws.com` (file)	gp3 EBS
GCP GKE	`pd.csi.storage.gke.io` (block), `filestore.csi.storage.gke.io` (file)	balanced PD
Azure AKS	`disk.csi.azure.com`, `file.csi.azure.com`	Standard SSD
minikube	`k8s.io/minikube-hostpath`	hostPath
kind	`rancher.io/local-path`	hostPath
On-prem	NFS subdir, Ceph RBD/CSI, Longhorn, etc.	varies by environment

Production clusters usually have one default SC designated, and PVCs that omit storageClassName fall back to it. Which SC is the default is marked by the annotation storageclass.kubernetes.io/is-default-class: "true".

Check the default SC

kubectl get sc

Example output

NAME                 PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
fast-ssd (default)   ebs.csi.aws.com         Retain          WaitForFirstConsumer   true                   10d
slow-hdd             ebs.csi.aws.com         Delete          WaitForFirstConsumer   true                   10d

The SC marked (default) is the default. If two SCs in a cluster are both marked default, new PVC behavior gets ambiguous, so the standard ops practice is to keep only one as default.

Four key fields in StorageClass #

The four SC manifest fields you’ll touch most often:

provisioner #

Which CSI driver creates the disk. As shown in the table above, the value is fixed per cluster environment. CSI (Container Storage Interface) is the standard interface K8s uses to talk to external storage drivers — stable since 1.13, and since then all in-tree drivers (legacy ones bundled in K8s itself) have moved to CSI external drivers. Provisioners you touch in new clusters are almost all *.csi.*.

reclaimPolicy #

The policy for what to do with a PV (and the actual disk behind it) when the PVC bound to it disappears. Two practical choices:

Value	Behavior
`Delete`	Delete the PV and the cloud disk along with the PVC. Default for cloud dynamic provisioning
`Retain`	PV stays in `Released` state, cloud disk preserved. A human has to clean up intentionally to reuse

You may see Recycle in older docs, but it’s deprecated — don’t use it in new manifests.

Delete is convenient but dangerous. In a production cluster, if someone accidentally deletes a PVC, the disk goes with it — unrecoverable. So the production pattern is often to freeze the SC’s reclaimPolicy to Retain. Even when the PVC is deleted, the PV stays in Released state with the data preserved. When cleanup is needed, a human deletes the PV explicitly, or rebinds it to another PVC to recover the data.

Retain — PV state after PVC deletion

NAME           CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM             AGE
pv-...         5Gi        RWO            Retain           Released   default/data      30m

Released means the PVC binding has been severed but the disk is still alive. If you don’t plan to reuse this PV, run kubectl delete pv ... to clean up explicitly — note that the cloud disk is not auto-deleted, so you’ll need to delete it from the cloud console to stop being billed for it.

volumeBindingMode #

The policy for when the PV (and disk) is created after a PVC is created.

Value	Behavior
`Immediate`	Create the disk as soon as the PVC is created
`WaitForFirstConsumer`	Create the disk after the Pod that mounts the PVC is scheduled to a node

Immediate is simple but causes incidents in multi-AZ environments. AWS EBS, for example, is a disk pinned to one AZ. With Immediate, a disk gets created in ap-northeast-2a, then the Pod that mounts it might land on a node in ap-northeast-2c — that Pod will never be able to mount the disk on that node. The manifest looks fine but the Pod gets stuck Pending.

WaitForFirstConsumer blocks that incident at the source. The PVC is created but the disk waits — when the Pod that mounts the PVC appears, the scheduler decides which node (which AZ) it lands on, and only then is the disk created in that AZ. The safe default for production, and the SC manifest above sets it. Single-AZ clusters can use Immediate without much pain, but multi-AZ environments practically require WaitForFirstConsumer.

allowVolumeExpansion #

Setting this to true lets you grow the disk later by raising the PVC’s spec.resources.requests.storage. The default is false, and you don’t usually flip an SC back to false once it’s been set to true — almost always set it to true when first creating the SC. Disks are awkward to shrink at runtime but commonly need to grow. Detailed behavior is in a later section.

Mounting a PVC in a Pod #

Once a PVC is Bound, a Pod mounts it. The Pod manifest shape is identical across any workload (Pod / Deployment / StatefulSet) — reference the PVC in spec.volumes and attach it to a path inside the container with spec.containers[].volumeMounts.

deployment-with-pvc.yaml — excerpt

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 1
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: nginx
          image: nginx:1.27
          volumeMounts:
            - name: html
              mountPath: /usr/share/nginx/html
      volumes:
        - name: html
          persistentVolumeClaim:
            claimName: data

spec.volumes[].persistentVolumeClaim.claimName points to the PVC data we made above, and the volume mounts at /usr/share/nginx/html inside the container. From the container’s perspective, that path is just a regular directory, and files written there survive Pod death and restart.

When multiple Pods mount one PVC #

What happens if you raise the Deployment above to replicas: 2? If the PVC data is RWO with a block backend like EBS, simultaneous mount only works when both Pods land on the same node. The second Pod landing on a different node fails to mount because the disk is already attached elsewhere, and gets stuck in Pending or ContainerCreating.

Two ways to avoid that:

Move the workload to an RWX backend (NFS / EFS, etc.) — make a accessModes: ReadWriteMany PVC so multiple Pods can share it
Move the workload to a StatefulSet so each Pod uses its own PVC — each Pod owns a disk, which is the next section’s volumeClaimTemplates

Stateless web servers without disk-backed data don’t need a PVC at all, so this incident doesn’t apply. The moment you need to mount a PVC is when you have to pick one of these two paths.

Revisiting StatefulSet’s volumeClaimTemplates #

Open up the StatefulSet manifest from #1 again:

web-statefulset.yaml — excerpt

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  serviceName: web
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: nginx
          image: nginx:1.27
          volumeMounts:
            - name: data
              mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: fast-ssd
        resources:
          requests:
            storage: 1Gi

volumeClaimTemplates is precisely a template that auto-generates one PVC per Pod. Apply this manifest with replicas: 3 and K8s does:

When creating Pod web-0, auto-create PVC data-web-0.
When creating Pod web-1, auto-create PVC data-web-1.
When creating Pod web-2, auto-create PVC data-web-2.

The PVC name rule is <volumeClaimTemplates.metadata.name>-<statefulset.metadata.name>-<ordinal>. In the example, the template name is data and the StatefulSet name is web, giving data-web-0, data-web-1, data-web-2.

Each PVC goes through the dynamic provisioning of the SC named fast-ssd and maps to a PV. On AWS, three new EBS volumes are created and 1:1 attached to one Pod each. When a Pod dies and comes back, the same index (e.g., web-0) re-mounts the same PVC (data-web-0), so the data is still there.

This volumeClaimTemplates + WaitForFirstConsumer SC combo runs cleanly in the multi-AZ environment from #1 too — the EBS volume is created in whatever AZ the Pod was scheduled to, blocking the AZ mismatch incident at the source.

PVC retention on scale-down is separate from reclaimPolicy #

The “PVCs survive scale-down” behavior pinned in #1 is easy to confuse with the PV’s reclaimPolicy. They’re policies at different layers:

PVC retention on StatefulSet scale-down — the StatefulSet controller’s behavior. Lowering replicas doesn’t delete the PVC objects. Since 1.27 you can change this with spec.persistentVolumeClaimRetentionPolicy.
PV (and disk) handling when a PVC is deleted — the SC’s reclaimPolicy. This decides what happens to the PV (and the disk behind it) when the PVC actually disappears.

Scale-down doesn’t auto-delete the PVC, so reclaimPolicy doesn’t fire; only when a human explicitly deletes the PVC does the SC’s reclaimPolicy engage. These two layers form a combined safety net that keeps data alive through ops accidents.

PVC expansion — allowVolumeExpansion #

A PVC’s disk often runs out at runtime. With an SC set to allowVolumeExpansion: true, you can grow the disk by raising the PVC’s spec.resources.requests.storage.

Expand the PVC — kubectl edit

kubectl edit pvc data

The change

spec:
  resources:
    requests:
      storage: 10Gi   # 5Gi -> 10Gi

K8s automatically progresses these steps:

Asks the CSI driver to expand the cloud disk (e.g., EBS volume modification).
The cloud disk grows to the new size.
Inside the container that has it mounted, the filesystem is grown (xfs_growfs / resize2fs).

Most CSI drivers do all three without restarting the Pod (online expansion). Some environment / filesystem combinations require a Pod restart, in which case the PVC status shows a condition like FileSystemResizePending. Restarting the Pod (e.g., a Deployment rolling update, or deleting one Pod in a StatefulSet) finalizes the step.

Disk shrinking is not supported by K8s. Lowering a PVC’s storage value is rejected. If you want to shrink, the only path is to create a smaller PVC, copy the data over, and clean up the old PVC.

Where backups and snapshots fit #

Sitting one layer above the PVC and PV model are the VolumeSnapshot and VolumeSnapshotClass objects. They express cloud disk snapshot capability as K8s manifests, and require CSI driver snapshot support (the AWS EBS / GCP PD / Azure Disk CSI drivers all support it).

volumesnapshot-data.yaml — short example

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: data-snap-2026-05-09
spec:
  volumeSnapshotClassName: ebs-snap
  source:
    persistentVolumeClaimName: data

Create this object and the CSI driver takes a snapshot of the cloud disk and registers its handle as a VolumeSnapshotContent inside K8s. Restoring a new PVC from that snapshot later is also via manifest (point the new PVC’s spec.dataSource at the snapshot). The deeper details belong in the K8s in Practice track, but knowing how backup/restore stacks on top of the persistent data model is enough at this level.

In production, data backup usually goes one of two paths:

Tools on top of K8s VolumeSnapshot — K8s-native backup tools like Velero or Kasten K10 manage VolumeSnapshots in bundles.
App-level dumps — for DBs, app-level dumps like pg_dump / mysqldump / Redis RDB shipped to separate storage (S3, etc.). Stronger consistency than disk snapshots.

Summary #

What this post pinned down:

Container filesystems are ephemeral — they disappear with the Pod. Data like DBs, uploads, and metrics has to be split off to disks outside.
Three-object separation — PV (the disk itself, cluster-scoped), PVC (the request, namespace-scoped), StorageClass (the blueprint for how to make a PV, cluster-scoped). App developers only write the PVC; the SC and provisioner fill in the rest.
Static vs dynamic provisioning — static is humans creating PVs in advance and PVCs matching them; dynamic is the SC creating a PV automatically when a PVC arrives. Production standard is dynamic.
accessModes — block (EBS / PD / Azure Disk) is RWO; file (EFS / Filestore / NFS) goes up to RWX. ReadWriteOncePod (1.22+) is exactly one Pod across the cluster.
Key SC fields — provisioner (which CSI), reclaimPolicy (Delete / Retain; Retain is safer for ops), volumeBindingMode (WaitForFirstConsumer is the safe multi-AZ default), allowVolumeExpansion (set to true from the start, recommended).
Pod mounting — reference the PVC in spec.volumes, attach to a path with spec.containers[].volumeMounts. Multiple Pods using one RWO PVC causes node mismatch incidents — fix with RWX or StatefulSet.
StatefulSet’s volumeClaimTemplates — a template that auto-creates one PVC per Pod. PVC name is <template>-<sts>-<ordinal>. Scale-down PVC retention (StatefulSet policy) and disk handling on PVC deletion (SC reclaimPolicy) are two separate layers.
Expansion and backup — grow disks via PVC storage on an allowVolumeExpansion: true SC; shrink not supported. Backup / snapshot is the place for VolumeSnapshot plus tools like Velero.

Once this model is in hand, whenever you encounter PV / PVC / SC objects in a cluster’s manifest directory, you can read at a glance who creates what and how the pieces bind together.

Next — Ingress and Ingress Controller #

What we covered through this post was the model of how data inside Pods survives within the cluster. The next post pivots the perspective to outside the cluster — how external traffic gets into the cluster’s Services.

Basics #5 noted that LoadBalancer is the standard for external entry among Service’s three types (ClusterIP / NodePort / LoadBalancer), but if a single cluster needs dozens of externally exposed Services, spinning up that many LoadBalancers becomes a burden in both cost and management. Routing requirements like “by domain” or “by path” can’t be solved at the LoadBalancer layer alone either.

#3 Ingress and Ingress Controller — the external entry point follows the model of Ingress, the object that concentrates that burden in one place, and the Ingress Controller (nginx / Traefik / GKE Ingress, etc.) that turns that manifest into actual traffic routing — in one cycle. HTTP / HTTPS routing, TLS termination, virtual hosts, and path-based routing as the shape of one manifest.