K8s Intermediate #2: PV / PVC / StorageClass — The Persistent Data Model
The second post in the K8s Intermediate series. Through #6 of Basics, what we pulled out of the manifest was config and secrets. Image tags, DB hosts, API keys all became external objects so the workload definition wasn’t tied to an environment. But one dimension still remains — the data itself. A container’s filesystem is ephemeral space that vanishes with the container, while DB data, user uploads, and Prometheus metric time series have to outlive the Pod. This post follows how K8s expresses that persistent disk model through the triangle of PersistentVolume, PersistentVolumeClaim, and StorageClass — in one cycle.
This series is K8s Intermediate, 7 posts.
- #1 StatefulSet / DaemonSet / Job / CronJob — Controllers beyond Deployment
- #2 PV / PVC / StorageClass — the persistent data model ← this post
- #3 Ingress and Ingress Controller — the external entry point
- #4 resources.requests / limits — Pod resource requests and limits
- #5 Health checks — liveness / readiness / startup probes
- #6 Autoscaling — HPA / VPA / Cluster Autoscaler
- #7 RBAC / NetworkPolicy / ResourceQuota — security and resource policy
The ephemerality of container filesystems — the starting point #
In #1 we noted “a 1:1 persistent volume per Pod” as one thing StatefulSet solves, but pushed the details of PVC to the next post. This post is that follow-up. Start from first principles — why we need an object called persistent volume.
The default container filesystem is ephemeral space sealed inside the container. When the container terminates, files inside disappear with it. When a container restarts inside the same Pod, even within the same Pod the filesystem starts over fresh. When the Pod itself moves to a different node, that ephemeral space is even more clearly gone. Deployment’s model from Basics #4 was “Pods can die and come back any time” — which is natural for stateless workloads but problematic for workloads that have to hold state.
To keep data alive, you write to a disk outside the Pod’s filesystem. That disk has to outlive the Pod and be remountable when a new Pod comes up. The way K8s expresses that requirement is the separation of PV / PVC / StorageClass into three objects.
Non-persistent volumes like emptyDir exist too #
Not every K8s volume is persistent. There are also volumes like emptyDir that only live as long as the Pod — used for two containers in the same Pod to swap files, or as scratch space for big temp file work. emptyDir disappears when the Pod disappears. This post is about the opposite — disks that survive separately from the Pod’s lifecycle.
The triangle — PV / PVC / StorageClass #
Each object’s responsibility, one line:
| Object | What it is | Scope | Who creates it |
|---|---|---|---|
| PersistentVolume (PV) | A representation of the disk itself. A piece of storage in the cluster. | Cluster-scoped | An admin creates it directly, or a StorageClass provisions it dynamically |
| PersistentVolumeClaim (PVC) | A request that says “give me this much disk in this mode” | Namespace-scoped | An app developer writes it in the manifest |
| StorageClass (SC) | A blueprint for how to make a PV when a PVC arrives | Cluster-scoped | An admin creates it ahead of time |
This separation is the heart of K8s’s persistent data model. App developers only write PVCs — “I need a 5Gi RWO disk.” Which cloud, which disk type satisfies that request lives in the SC, and the actual disk’s representation maps to a PV. Thanks to this, the same manifest can flow to EBS on AWS, PD on GCP, NFS or Ceph on-prem.
A mental picture:
App manifest
│
▼
PVC (5Gi, RWO, storageClassName=fast-ssd)
│
├── (static) Find a matching PV among admin-created ones, Bound
│
└── (dynamic) StorageClass(fast-ssd)'s provisioner
creates a new disk and registers it as a PV → Bound to that PVBound means a PVC and a PV are paired 1:1. One PVC binds to one PV; one PV binds to one PVC. A Pod looks at the PVC for mounting, never the PV directly. That single layer of indirection lets the workload manifest stay the same when the disk backend changes.
Static provisioning — the simplest model #
Before dynamic provisioning, walk through the simplest model — an admin creates a PV by hand and a PVC binds to it. This shape isn’t used much in production, but it’s the best starting point to understand the matching rules between PV and PVC.
On local clusters like minikube or kind, hostPath — which borrows a path on the node’s host filesystem as the PV’s backing storage — is common. Not appropriate for production but plenty for learning.
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-local-1g
spec:
capacity:
storage: 1Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: manual
hostPath:
path: /mnt/data/pv-local-1gspec.capacity.storage is the disk size, spec.accessModes is how many can mount in what mode (covered shortly), persistentVolumeReclaimPolicy is what to do with the PV when its PVC disappears. storageClassName: manual is a marker meaning “this PV doesn’t belong to any SC — it was created by hand.” That label is used in matching against PVCs.
The PVC to pair with it:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data
namespace: default
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: manualThe rules a PVC uses to pick a PV are straightforward. All three must match for binding:
storageClassNamematches — both sides above aremanual. PVC’sstorageClassName: ""(empty string) means “PVs without an SC only”; omitting the field means “follow the cluster’s default SC.”- PV satisfies the PVC’s
accessModes— PVC requests RWO and PV supports both RWO/RWX → OK; the reverse (PV only RWO, PVC requests RWX) doesn’t match. - PV’s
capacityis at least the PVC’s request — PVC requests 1Gi and PV has 1Gi or more → match. A large PV bound to a small PVC wastes the difference.
kubectl apply -f pv-static.yaml
kubectl apply -f pvc-static.yaml
kubectl get pv,pvcNAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS AGE
persistentvolume/pv-local-1g 1Gi RWO Retain Bound default/data manual 10s
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/data Bound pv-local-1g 1Gi RWO manual 5sBoth PV and PVC at STATUS: Bound, holding each other’s name in the CLAIM / VOLUME columns — this is the normal shape. Mount this PVC in a Pod and inside the container it looks like a regular directory, but it’s actually mapped to /mnt/data/pv-local-1g on the node.
1:1 explicit binding — claimRef / volumeName #
The matching above is K8s automatically finding by storageClassName + accessModes + capacity. To pin a specific PV to a specific PVC explicitly:
- Write the PVC’s namespace + name in the PV’s
spec.claimRef - Write the PV’s name in the PVC’s
spec.volumeName
When both are present, K8s ignores other matching candidates and binds the 1:1 directly. Rare in production but shows up in migration or disk recovery scenarios.
accessModes — who can mount how #
accessModes expresses the disk’s concurrent mount capability. There are four:
| Mode | Short | Meaning |
|---|---|---|
ReadWriteOnce | RWO | Mount read-write on one node. Multiple Pods on the same node can mount together |
ReadOnlyMany | ROX | Mount read-only across multiple nodes simultaneously |
ReadWriteMany | RWX | Mount read-write across multiple nodes simultaneously |
ReadWriteOncePod | RWOP | Mount read-write from exactly one Pod across the entire cluster (1.22+ stable) |
In production, RWO and RWX are the two you see most often. Which mode is possible depends on the type of backend disk:
| Backend | Supported modes | Note |
|---|---|---|
| AWS EBS | RWO | Attaches to one node in one AZ |
| GCP Persistent Disk | RWO (regional supports ROX) | Default is one node |
| Azure Disk | RWO | Single node |
| AWS EFS / GCP Filestore / Azure Files | RWX | NFS-based file storage |
| On-prem NFS | ROX, RWX | File storage |
| Ceph RBD (block) | RWO | Block |
| CephFS / GlusterFS | RWX | File |
In one line — block storage is RWO, file storage can go up to RWX. Workloads that need RWX (multiple Pods sharing one directory, e.g., WordPress’s uploads dir, a shared cache) need an NFS-class backend. PVCs requesting RWX against block disks will never bind.
ReadWriteOncePod — preventing DB split-brain #
ReadWriteOncePod, stable since 1.22, is a tighter constraint than RWO. RWO is “multiple Pods on one node can mount together”; RWOP is “exactly one Pod across the entire cluster.” Used as a safety net for workloads like databases where two processes touching the same data files concurrently corrupt them — also blocks the accident where another namespace’s Pod on the same node accidentally pulls the same PVC.
StorageClass and dynamic provisioning #
Static provisioning has an obvious overhead — the cluster admin has to create PVs by hand in advance. Every time a new disk is needed, the human creates a disk in the cloud console, writes a PV manifest, and applies it. In production clusters, this shape becomes a bottleneck.
StorageClass fills that gap. Create an SC once up front, and when a PVC referencing that SC arrives, K8s’s provisioner automatically creates the disk, registers it as a PV, and binds it to the PVC. The human is removed from the PV step.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
type: gp3
encrypted: "true"
iops: "3000"
throughput: "125"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: trueA PVC referencing this SC becomes lighter:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data
namespace: default
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: fast-ssdApply this PVC and the following happens automatically:
- K8s reads the PVC’s
storageClassNameand finds thefast-ssdSC. - The SC’s
provisioner(ebs.csi.aws.com) is called and a new 5Gi gp3 EBS volume is created. - That EBS volume is registered as a PV object automatically.
- The new PV becomes
Boundto the PVC.
A short table of provisioners by environment:
| Environment | provisioner | Default SC |
|---|---|---|
| AWS EKS | ebs.csi.aws.com (block), efs.csi.aws.com (file) | gp3 EBS |
| GCP GKE | pd.csi.storage.gke.io (block), filestore.csi.storage.gke.io (file) | balanced PD |
| Azure AKS | disk.csi.azure.com, file.csi.azure.com | Standard SSD |
| minikube | k8s.io/minikube-hostpath | hostPath |
| kind | rancher.io/local-path | hostPath |
| On-prem | NFS subdir, Ceph RBD/CSI, Longhorn, etc. | varies by environment |
Production clusters usually have one default SC designated, and PVCs that omit storageClassName fall back to it. Which SC is the default is marked by the annotation storageclass.kubernetes.io/is-default-class: "true".
kubectl get scNAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
fast-ssd (default) ebs.csi.aws.com Retain WaitForFirstConsumer true 10d
slow-hdd ebs.csi.aws.com Delete WaitForFirstConsumer true 10dThe SC marked (default) is the default. If two SCs in a cluster are both marked default, new PVC behavior gets ambiguous, so the standard ops practice is to keep only one as default.
Four key fields in StorageClass #
The four SC manifest fields you’ll touch most often:
provisioner #
Which CSI driver creates the disk. As shown in the table above, the value is fixed per cluster environment. CSI (Container Storage Interface) is the standard interface K8s uses to talk to external storage drivers — stable since 1.13, and since then all in-tree drivers (legacy ones bundled in K8s itself) have moved to CSI external drivers. Provisioners you touch in new clusters are almost all *.csi.*.
reclaimPolicy #
The policy for what to do with a PV (and the actual disk behind it) when the PVC bound to it disappears. Two practical choices:
| Value | Behavior |
|---|---|
Delete | Delete the PV and the cloud disk along with the PVC. Default for cloud dynamic provisioning |
Retain | PV stays in Released state, cloud disk preserved. A human has to clean up intentionally to reuse |
You may see Recycle in older docs, but it’s deprecated — don’t use it in new manifests.
Delete is convenient but dangerous. In a production cluster, if someone accidentally deletes a PVC, the disk goes with it — unrecoverable. So the production pattern is often to freeze the SC’s reclaimPolicy to Retain. Even when the PVC is deleted, the PV stays in Released state with the data preserved. When cleanup is needed, a human deletes the PV explicitly, or rebinds it to another PVC to recover the data.
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM AGE
pv-... 5Gi RWO Retain Released default/data 30mReleased means the PVC binding has been severed but the disk is still alive. If you don’t plan to reuse this PV, run kubectl delete pv ... to clean up explicitly — note that the cloud disk is not auto-deleted, so you’ll need to delete it from the cloud console to stop being billed for it.
volumeBindingMode #
The policy for when the PV (and disk) is created after a PVC is created.
| Value | Behavior |
|---|---|
Immediate | Create the disk as soon as the PVC is created |
WaitForFirstConsumer | Create the disk after the Pod that mounts the PVC is scheduled to a node |
Immediate is simple but causes incidents in multi-AZ environments. AWS EBS, for example, is a disk pinned to one AZ. With Immediate, a disk gets created in ap-northeast-2a, then the Pod that mounts it might land on a node in ap-northeast-2c — that Pod will never be able to mount the disk on that node. The manifest looks fine but the Pod gets stuck Pending.
WaitForFirstConsumer blocks that incident at the source. The PVC is created but the disk waits — when the Pod that mounts the PVC appears, the scheduler decides which node (which AZ) it lands on, and only then is the disk created in that AZ. The safe default for production, and the SC manifest above sets it. Single-AZ clusters can use Immediate without much pain, but multi-AZ environments practically require WaitForFirstConsumer.
allowVolumeExpansion #
Setting this to true lets you grow the disk later by raising the PVC’s spec.resources.requests.storage. The default is false, and you don’t usually flip an SC back to false once it’s been set to true — almost always set it to true when first creating the SC. Disks are awkward to shrink at runtime but commonly need to grow. Detailed behavior is in a later section.
Mounting a PVC in a Pod #
Once a PVC is Bound, a Pod mounts it. The Pod manifest shape is identical across any workload (Pod / Deployment / StatefulSet) — reference the PVC in spec.volumes and attach it to a path inside the container with spec.containers[].volumeMounts.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 1
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: nginx
image: nginx:1.27
volumeMounts:
- name: html
mountPath: /usr/share/nginx/html
volumes:
- name: html
persistentVolumeClaim:
claimName: dataspec.volumes[].persistentVolumeClaim.claimName points to the PVC data we made above, and the volume mounts at /usr/share/nginx/html inside the container. From the container’s perspective, that path is just a regular directory, and files written there survive Pod death and restart.
When multiple Pods mount one PVC #
What happens if you raise the Deployment above to replicas: 2? If the PVC data is RWO with a block backend like EBS, simultaneous mount only works when both Pods land on the same node. The second Pod landing on a different node fails to mount because the disk is already attached elsewhere, and gets stuck in Pending or ContainerCreating.
Two ways to avoid that:
- Move the workload to an RWX backend (NFS / EFS, etc.) — make a
accessModes: ReadWriteManyPVC so multiple Pods can share it - Move the workload to a StatefulSet so each Pod uses its own PVC — each Pod owns a disk, which is the next section’s
volumeClaimTemplates
Stateless web servers without disk-backed data don’t need a PVC at all, so this incident doesn’t apply. The moment you need to mount a PVC is when you have to pick one of these two paths.
Revisiting StatefulSet’s volumeClaimTemplates #
Open up the StatefulSet manifest from #1 again:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
serviceName: web
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: nginx
image: nginx:1.27
volumeMounts:
- name: data
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 1GivolumeClaimTemplates is precisely a template that auto-generates one PVC per Pod. Apply this manifest with replicas: 3 and K8s does:
- When creating Pod
web-0, auto-create PVCdata-web-0. - When creating Pod
web-1, auto-create PVCdata-web-1. - When creating Pod
web-2, auto-create PVCdata-web-2.
The PVC name rule is <volumeClaimTemplates.metadata.name>-<statefulset.metadata.name>-<ordinal>. In the example, the template name is data and the StatefulSet name is web, giving data-web-0, data-web-1, data-web-2.
Each PVC goes through the dynamic provisioning of the SC named fast-ssd and maps to a PV. On AWS, three new EBS volumes are created and 1:1 attached to one Pod each. When a Pod dies and comes back, the same index (e.g., web-0) re-mounts the same PVC (data-web-0), so the data is still there.
This volumeClaimTemplates + WaitForFirstConsumer SC combo runs cleanly in the multi-AZ environment from #1 too — the EBS volume is created in whatever AZ the Pod was scheduled to, blocking the AZ mismatch incident at the source.
PVC retention on scale-down is separate from reclaimPolicy #
The “PVCs survive scale-down” behavior pinned in #1 is easy to confuse with the PV’s reclaimPolicy. They’re policies at different layers:
- PVC retention on StatefulSet scale-down — the StatefulSet controller’s behavior. Lowering
replicasdoesn’t delete the PVC objects. Since 1.27 you can change this withspec.persistentVolumeClaimRetentionPolicy. - PV (and disk) handling when a PVC is deleted — the SC’s
reclaimPolicy. This decides what happens to the PV (and the disk behind it) when the PVC actually disappears.
Scale-down doesn’t auto-delete the PVC, so reclaimPolicy doesn’t fire; only when a human explicitly deletes the PVC does the SC’s reclaimPolicy engage. These two layers form a combined safety net that keeps data alive through ops accidents.
PVC expansion — allowVolumeExpansion #
A PVC’s disk often runs out at runtime. With an SC set to allowVolumeExpansion: true, you can grow the disk by raising the PVC’s spec.resources.requests.storage.
kubectl edit pvc dataspec:
resources:
requests:
storage: 10Gi # 5Gi -> 10GiK8s automatically progresses these steps:
- Asks the CSI driver to expand the cloud disk (e.g., EBS volume modification).
- The cloud disk grows to the new size.
- Inside the container that has it mounted, the filesystem is grown (
xfs_growfs/resize2fs).
Most CSI drivers do all three without restarting the Pod (online expansion). Some environment / filesystem combinations require a Pod restart, in which case the PVC status shows a condition like FileSystemResizePending. Restarting the Pod (e.g., a Deployment rolling update, or deleting one Pod in a StatefulSet) finalizes the step.
Disk shrinking is not supported by K8s. Lowering a PVC’s storage value is rejected. If you want to shrink, the only path is to create a smaller PVC, copy the data over, and clean up the old PVC.
Where backups and snapshots fit #
Sitting one layer above the PVC and PV model are the VolumeSnapshot and VolumeSnapshotClass objects. They express cloud disk snapshot capability as K8s manifests, and require CSI driver snapshot support (the AWS EBS / GCP PD / Azure Disk CSI drivers all support it).
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: data-snap-2026-05-09
spec:
volumeSnapshotClassName: ebs-snap
source:
persistentVolumeClaimName: dataCreate this object and the CSI driver takes a snapshot of the cloud disk and registers its handle as a VolumeSnapshotContent inside K8s. Restoring a new PVC from that snapshot later is also via manifest (point the new PVC’s spec.dataSource at the snapshot). The deeper details belong in the K8s in Practice track, but knowing how backup/restore stacks on top of the persistent data model is enough at this level.
In production, data backup usually goes one of two paths:
- Tools on top of K8s VolumeSnapshot — K8s-native backup tools like Velero or Kasten K10 manage VolumeSnapshots in bundles.
- App-level dumps — for DBs, app-level dumps like
pg_dump/mysqldump/ Redis RDB shipped to separate storage (S3, etc.). Stronger consistency than disk snapshots.
Summary #
What this post pinned down:
- Container filesystems are ephemeral — they disappear with the Pod. Data like DBs, uploads, and metrics has to be split off to disks outside.
- Three-object separation —
PV(the disk itself, cluster-scoped),PVC(the request, namespace-scoped),StorageClass(the blueprint for how to make a PV, cluster-scoped). App developers only write the PVC; the SC and provisioner fill in the rest. - Static vs dynamic provisioning — static is humans creating PVs in advance and PVCs matching them; dynamic is the SC creating a PV automatically when a PVC arrives. Production standard is dynamic.
- accessModes — block (EBS / PD / Azure Disk) is RWO; file (EFS / Filestore / NFS) goes up to RWX.
ReadWriteOncePod(1.22+) is exactly one Pod across the cluster. - Key SC fields —
provisioner(which CSI),reclaimPolicy(Delete/Retain;Retainis safer for ops),volumeBindingMode(WaitForFirstConsumeris the safe multi-AZ default),allowVolumeExpansion(set totruefrom the start, recommended). - Pod mounting — reference the PVC in
spec.volumes, attach to a path withspec.containers[].volumeMounts. Multiple Pods using one RWO PVC causes node mismatch incidents — fix with RWX or StatefulSet. - StatefulSet’s volumeClaimTemplates — a template that auto-creates one PVC per Pod. PVC name is
<template>-<sts>-<ordinal>. Scale-down PVC retention (StatefulSet policy) and disk handling on PVC deletion (SC reclaimPolicy) are two separate layers. - Expansion and backup — grow disks via PVC storage on an
allowVolumeExpansion: trueSC; shrink not supported. Backup / snapshot is the place forVolumeSnapshotplus tools like Velero.
Once this model is in hand, whenever you encounter PV / PVC / SC objects in a cluster’s manifest directory, you can read at a glance who creates what and how the pieces bind together.
Next — Ingress and Ingress Controller #
What we covered through this post was the model of how data inside Pods survives within the cluster. The next post pivots the perspective to outside the cluster — how external traffic gets into the cluster’s Services.
Basics #5 noted that LoadBalancer is the standard for external entry among Service’s three types (ClusterIP / NodePort / LoadBalancer), but if a single cluster needs dozens of externally exposed Services, spinning up that many LoadBalancers becomes a burden in both cost and management. Routing requirements like “by domain” or “by path” can’t be solved at the LoadBalancer layer alone either.
#3 Ingress and Ingress Controller — the external entry point follows the model of Ingress, the object that concentrates that burden in one place, and the Ingress Controller (nginx / Traefik / GKE Ingress, etc.) that turns that manifest into actual traffic routing — in one cycle. HTTP / HTTPS routing, TLS termination, virtual hosts, and path-based routing as the shape of one manifest.