Certified Kubernetes Application Developer (CKAD) #15 SecurityContext and Capabilities: runAsUser, fsGroup, readOnly rootfs
In #14 ServiceAccount and RBAC we restricted what a container can do to the Kubernetes API. This time we go one layer deeper and restrict what privileges the container process itself runs with on Linux. Left at their defaults, many images run as root with free rein to write anywhere on the filesystem. If an attacker takes over such a container, they inherit exactly those privileges.
securityContext is the field that declaratively controls which UID/GID a container runs as, whether it can write to the root filesystem, and which Linux kernel capabilities it holds. CKAD tests it directly with tasks like “run this container as non-root”, “add only the NET_ADMIN capability”, and “make the root filesystem read-only”. Because the grading script inspects the result through id or manifest fields, knowing the exact field location and spelling is what earns the points.
securityContext attaches at two levels #
securityContext can be declared in both places: the Pod level and the container level. The two differ in location and in the scope they apply to.
| Location | Path | Scope |
|---|---|---|
| Pod level | spec.securityContext | The default for every container in the Pod |
| Container level | spec.containers[].securityContext | Applies only to that container |
When the same field exists in both places, the container level overrides the Pod level. In other words, you lay down a common policy at the Pod level and grant exceptions for specific containers at the container level.
apiVersion: v1
kind: Pod
metadata:
name: ctx-demo
spec:
securityContext: # Pod level: default for all containers
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000
containers:
- name: app
image: busybox:1.36
command: ["sh", "-c", "sleep 3600"]
securityContext: # Container level: overrides for this container only
runAsUser: 2000In the example above the process in the app container runs as UID 2000, since the container level takes precedence. Had there been another container in the same Pod, it would follow the Pod-level value of UID 1000. Meanwhile, a field that has no container-level counterpart (fsGroup is Pod-level only) applies straight from the Pod level.
There is one more distinction worth noting. Some fields, like runAsUser and capabilities, exist only at the container level, while others, like fsGroup and supplementalGroups, exist only at the Pod level. When you can’t remember which field belongs to which level, check it instantly with kubectl explain.
k explain pod.spec.securityContext
k explain pod.spec.containers.securityContextCore fields: which user to run as #
The most frequently tested bundle is controlling the execution user.
| Field | Level | Meaning |
|---|---|---|
runAsUser | Both | Sets the process UID |
runAsGroup | Both | Sets the process’s primary GID |
runAsNonRoot | Both | If true, refuses to run as root (UID 0) |
fsGroup | Pod | The owning group GID for mounted volumes |
supplementalGroups | Pod | A list of extra supplementary GIDs granted to the process |
runAsUser / runAsGroup / runAsNonRoot #
runAsUser sets the UID the container’s main process starts as. runAsNonRoot: true goes a step further: if the image is built to start as root, it fails the container outright instead of starting it. It’s the surest line of defense when you want to enforce non-root execution.
apiVersion: v1
kind: Pod
metadata:
name: nonroot-demo
spec:
containers:
- name: app
image: busybox:1.36
command: ["sh", "-c", "id && sleep 3600"]
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 3000If you set only runAsNonRoot: true and leave runAsUser empty, the container fails to come up with CreateContainerConfigError when the image’s default USER is root. Specifying a non-root UID alongside it is the safer choice.
fsGroup: the volume’s owning group #
Running as non-root often hits the problem of not being able to write to a mounted volume. When you set fsGroup, Kubernetes changes the volume’s group ownership to that GID at mount time and grants the process that GID as a supplementary group. As a result, even a non-root process can write to the volume.
apiVersion: v1
kind: Pod
metadata:
name: fsgroup-demo
spec:
securityContext:
runAsUser: 1000
fsGroup: 2000 # group ownership of /data changes to 2000
containers:
- name: app
image: busybox:1.36
command: ["sh", "-c", "touch /data/test && ls -l /data && sleep 3600"]
volumeMounts:
- name: scratch
mountPath: /data
volumes:
- name: scratch
emptyDir: {}When this Pod comes up, the group of /data is set to 2000, so the process running as UID 1000 succeeds at touch. supplementalGroups is similar, but it doesn’t change volume ownership — it only adds supplementary GIDs to the process.
Controlling filesystem permissions #
readOnlyRootFilesystem #
Making the container root filesystem read-only makes it hard for an attacker who gets in to plant binaries or tamper with configuration. readOnlyRootFilesystem: true is a container-level field.
securityContext:
readOnlyRootFilesystem: trueThe catch is that many apps need to write to /tmp or a cache directory to function correctly. In that case, keep the root filesystem read-only but work around it by mounting an emptyDir only on the paths that need to be writable.
apiVersion: v1
kind: Pod
metadata:
name: ro-rootfs
spec:
containers:
- name: app
image: nginx:1.27
securityContext:
readOnlyRootFilesystem: true
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /var/cache/nginx
- name: run
mountPath: /var/run
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir: {}
- name: run
emptyDir: {}The root is read-only, but /tmp, /var/cache/nginx, and /var/run are emptyDir and therefore writable. This pattern shows up in the form of “make the root filesystem read-only while keeping the app working”.
allowPrivilegeEscalation #
allowPrivilegeEscalation: false blocks a process from gaining privileges higher than its own (for example, running a setuid binary). Paired with non-root execution, it closes off one more privilege-escalation path. It’s a container-level field.
securityContext:
allowPrivilegeEscalation: falseLinux capabilities #
Linux manages root’s privileges in finely sliced units called capabilities. The container runtime grants only a subset of capabilities by default, and you can add and drop individual ones with securityContext.capabilities. This field is container-level only.
securityContext:
capabilities:
add: ["NET_ADMIN", "SYS_TIME"]
drop: ["ALL"]Write the values without the CAP_ prefix (for example, NET_ADMIN rather than CAP_NET_ADMIN). The security best practice is the least-privilege approach: drop everything with drop: ["ALL"], then add back only what you truly need.
| Capability | Example use |
|---|---|
NET_ADMIN | Manipulate network interfaces, routing, iptables |
SYS_TIME | Change the system clock |
CHOWN | Change file ownership |
NET_BIND_SERVICE | Bind to ports below 1024 |
Using drop and add together creates no conflict. Think of it as dropping everything first, then re-attaching only what you explicitly named.
apiVersion: v1
kind: Pod
metadata:
name: cap-demo
spec:
containers:
- name: app
image: busybox:1.36
command: ["sh", "-c", "sleep 3600"]
securityContext:
capabilities:
drop: ["ALL"]
add: ["NET_ADMIN"]The danger of privileged: true #
privileged: true grants the container nearly all of the host’s privileges. It enables every capability and even opens up device access, effectively dismantling container isolation. Never enable it unless an exam task or your work explicitly requires it. On CKAD it shows up as “turn off privileged” or “change to least privilege”, rarely as a request to enable it.
securityContext:
privileged: false # the default, and the recommended valueseccompProfile #
seccompProfile restricts which system calls the container can make. For CKAD it’s enough to know the single line that applies the RuntimeDefault profile; custom profiles and detailed operations are CKS territory.
securityContext:
seccompProfile:
type: RuntimeDefaultVerifying the result #
After applying the manifest, check the actual runtime identity inside the container to see whether it matches the grading criteria.
# Check UID/GID and supplementary groups
k exec ctx-demo -- id
# Check the user name (a UID with no name may show as a number)
k exec nonroot-demo -- whoami
# Check volume group ownership
k exec fsgroup-demo -- ls -ld /data
# Check the read-only root (a write attempt should fail to be correct)
k exec ro-rootfs -c app -- touch /test 2>&1 || echo "read-only confirmed"When the uid, gid, and groups in the id output match the values you wrote in the manifest, the change has taken effect. If runAsNonRoot: true is set but the container comes up with CreateContainerConfigError, that’s a sign the image is built to run as root, so specify a non-root UID.
Putting it all together #
Here is a manifest that applies non-root execution, a read-only root filesystem, and minimal capabilities all at once. It’s the classic shape of a “hardened Pod” that exams ask for.
apiVersion: v1
kind: Pod
metadata:
name: hardened
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
containers:
- name: app
image: nginx:1.27
ports:
- containerPort: 8080
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
add: ["NET_BIND_SERVICE"]
volumeMounts:
- name: tmp
mountPath: /tmp
- name: run
mountPath: /var/run
volumes:
- name: tmp
emptyDir: {}
- name: run
emptyDir: {}This Pod comes up as non-root UID 1000, has privilege escalation blocked, and drops all capabilities before re-attaching only NET_BIND_SERVICE, which is needed to bind to ports below 1024. The root is read-only, but mounting emptyDir on /tmp and /var/run keeps it running without trouble.
Exam points #
securityContextlives in two places — the Pod level (spec.securityContext) and the container level (spec.containers[].securityContext) — and the container level overrides the Pod level.runAsUser,runAsGroup,runAsNonRoot,capabilities,readOnlyRootFilesystem,allowPrivilegeEscalation, andprivilegedare container-level;fsGroupandsupplementalGroupsare Pod-level only.- Write capability values without the
CAP_prefix. The model answer isdrop: ["ALL"]thenaddback only what’s needed. - When
readOnlyRootFilesystem: trueblocks writing, work around it by mounting anemptyDiron the writable path. - If you set only
runAsNonRoot: truewithout a non-root UID, a root image may fail to start, so specifyrunAsUseralongside it. - Verify with
k exec -- idandk exec -- whoami. Confirm the actual UID/GID with your own eyes before grading. - When you’re unsure of a field’s location, check it instantly with
k explain pod.spec.securityContextandk explain pod.spec.containers.securityContext.
Wrap-up #
What this post locked in:
- The two levels of securityContext. The Pod level is the common default, the container level is the exception, and the container level takes precedence
- Controlling the execution user. Enforce non-root with
runAsUser,runAsGroup, andrunAsNonRoot; secure volume write access withfsGroup - Filesystem control.
readOnlyRootFilesystem+ theemptyDirworkaround,allowPrivilegeEscalation: false - capabilities.
drop: ["ALL"]then minimaladd; avoidprivileged: true, which dismantles isolation - Verification. Confirm the actual identity with
k exec -- idandwhoami
Next: resource management #
Now that we’ve narrowed the container’s privileges, it’s time to control how much resource a container can use.
In #16 Resource Management: requests/limits, QoS class, LimitRange we’ll build it ourselves — how CPU and memory requests and limits affect scheduling and OOM, the QoS class their combination decides (Guaranteed, Burstable, BestEffort), and LimitRange and ResourceQuota for setting per-namespace defaults.