Certified Kubernetes Application Developer (CKAD) #15 SecurityContext and Capabilities: runAsUser, fsGroup, readOnly rootfs

In #14 ServiceAccount and RBAC we restricted what a container can do to the Kubernetes API. This time we go one layer deeper and restrict what privileges the container process itself runs with on Linux. Left at their defaults, many images run as root with free rein to write anywhere on the filesystem. If an attacker takes over such a container, they inherit exactly those privileges.

securityContext is the field that declaratively controls which UID/GID a container runs as, whether it can write to the root filesystem, and which Linux kernel capabilities it holds. CKAD tests it directly with tasks like “run this container as non-root”, “add only the NET_ADMIN capability”, and “make the root filesystem read-only”. Because the grading script inspects the result through id or manifest fields, knowing the exact field location and spelling is what earns the points.

securityContext attaches at two levels #

securityContext can be declared in both places: the Pod level and the container level. The two differ in location and in the scope they apply to.

LocationPathScope
Pod levelspec.securityContextThe default for every container in the Pod
Container levelspec.containers[].securityContextApplies only to that container

When the same field exists in both places, the container level overrides the Pod level. In other words, you lay down a common policy at the Pod level and grant exceptions for specific containers at the container level.

apiVersion: v1
kind: Pod
metadata:
  name: ctx-demo
spec:
  securityContext:          # Pod level: default for all containers
    runAsUser: 1000
    runAsGroup: 3000
    fsGroup: 2000
  containers:
    - name: app
      image: busybox:1.36
      command: ["sh", "-c", "sleep 3600"]
      securityContext:      # Container level: overrides for this container only
        runAsUser: 2000

In the example above the process in the app container runs as UID 2000, since the container level takes precedence. Had there been another container in the same Pod, it would follow the Pod-level value of UID 1000. Meanwhile, a field that has no container-level counterpart (fsGroup is Pod-level only) applies straight from the Pod level.

There is one more distinction worth noting. Some fields, like runAsUser and capabilities, exist only at the container level, while others, like fsGroup and supplementalGroups, exist only at the Pod level. When you can’t remember which field belongs to which level, check it instantly with kubectl explain.

k explain pod.spec.securityContext
k explain pod.spec.containers.securityContext

Core fields: which user to run as #

The most frequently tested bundle is controlling the execution user.

FieldLevelMeaning
runAsUserBothSets the process UID
runAsGroupBothSets the process’s primary GID
runAsNonRootBothIf true, refuses to run as root (UID 0)
fsGroupPodThe owning group GID for mounted volumes
supplementalGroupsPodA list of extra supplementary GIDs granted to the process

runAsUser / runAsGroup / runAsNonRoot #

runAsUser sets the UID the container’s main process starts as. runAsNonRoot: true goes a step further: if the image is built to start as root, it fails the container outright instead of starting it. It’s the surest line of defense when you want to enforce non-root execution.

apiVersion: v1
kind: Pod
metadata:
  name: nonroot-demo
spec:
  containers:
    - name: app
      image: busybox:1.36
      command: ["sh", "-c", "id && sleep 3600"]
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 3000

If you set only runAsNonRoot: true and leave runAsUser empty, the container fails to come up with CreateContainerConfigError when the image’s default USER is root. Specifying a non-root UID alongside it is the safer choice.

fsGroup: the volume’s owning group #

Running as non-root often hits the problem of not being able to write to a mounted volume. When you set fsGroup, Kubernetes changes the volume’s group ownership to that GID at mount time and grants the process that GID as a supplementary group. As a result, even a non-root process can write to the volume.

apiVersion: v1
kind: Pod
metadata:
  name: fsgroup-demo
spec:
  securityContext:
    runAsUser: 1000
    fsGroup: 2000          # group ownership of /data changes to 2000
  containers:
    - name: app
      image: busybox:1.36
      command: ["sh", "-c", "touch /data/test && ls -l /data && sleep 3600"]
      volumeMounts:
        - name: scratch
          mountPath: /data
  volumes:
    - name: scratch
      emptyDir: {}

When this Pod comes up, the group of /data is set to 2000, so the process running as UID 1000 succeeds at touch. supplementalGroups is similar, but it doesn’t change volume ownership — it only adds supplementary GIDs to the process.

Controlling filesystem permissions #

readOnlyRootFilesystem #

Making the container root filesystem read-only makes it hard for an attacker who gets in to plant binaries or tamper with configuration. readOnlyRootFilesystem: true is a container-level field.

securityContext:
  readOnlyRootFilesystem: true

The catch is that many apps need to write to /tmp or a cache directory to function correctly. In that case, keep the root filesystem read-only but work around it by mounting an emptyDir only on the paths that need to be writable.

apiVersion: v1
kind: Pod
metadata:
  name: ro-rootfs
spec:
  containers:
    - name: app
      image: nginx:1.27
      securityContext:
        readOnlyRootFilesystem: true
      volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: cache
          mountPath: /var/cache/nginx
        - name: run
          mountPath: /var/run
  volumes:
    - name: tmp
      emptyDir: {}
    - name: cache
      emptyDir: {}
    - name: run
      emptyDir: {}

The root is read-only, but /tmp, /var/cache/nginx, and /var/run are emptyDir and therefore writable. This pattern shows up in the form of “make the root filesystem read-only while keeping the app working”.

allowPrivilegeEscalation #

allowPrivilegeEscalation: false blocks a process from gaining privileges higher than its own (for example, running a setuid binary). Paired with non-root execution, it closes off one more privilege-escalation path. It’s a container-level field.

securityContext:
  allowPrivilegeEscalation: false

Linux capabilities #

Linux manages root’s privileges in finely sliced units called capabilities. The container runtime grants only a subset of capabilities by default, and you can add and drop individual ones with securityContext.capabilities. This field is container-level only.

securityContext:
  capabilities:
    add: ["NET_ADMIN", "SYS_TIME"]
    drop: ["ALL"]

Write the values without the CAP_ prefix (for example, NET_ADMIN rather than CAP_NET_ADMIN). The security best practice is the least-privilege approach: drop everything with drop: ["ALL"], then add back only what you truly need.

CapabilityExample use
NET_ADMINManipulate network interfaces, routing, iptables
SYS_TIMEChange the system clock
CHOWNChange file ownership
NET_BIND_SERVICEBind to ports below 1024

Using drop and add together creates no conflict. Think of it as dropping everything first, then re-attaching only what you explicitly named.

apiVersion: v1
kind: Pod
metadata:
  name: cap-demo
spec:
  containers:
    - name: app
      image: busybox:1.36
      command: ["sh", "-c", "sleep 3600"]
      securityContext:
        capabilities:
          drop: ["ALL"]
          add: ["NET_ADMIN"]

The danger of privileged: true #

privileged: true grants the container nearly all of the host’s privileges. It enables every capability and even opens up device access, effectively dismantling container isolation. Never enable it unless an exam task or your work explicitly requires it. On CKAD it shows up as “turn off privileged” or “change to least privilege”, rarely as a request to enable it.

securityContext:
  privileged: false   # the default, and the recommended value

seccompProfile #

seccompProfile restricts which system calls the container can make. For CKAD it’s enough to know the single line that applies the RuntimeDefault profile; custom profiles and detailed operations are CKS territory.

securityContext:
  seccompProfile:
    type: RuntimeDefault

Verifying the result #

After applying the manifest, check the actual runtime identity inside the container to see whether it matches the grading criteria.

# Check UID/GID and supplementary groups
k exec ctx-demo -- id

# Check the user name (a UID with no name may show as a number)
k exec nonroot-demo -- whoami

# Check volume group ownership
k exec fsgroup-demo -- ls -ld /data

# Check the read-only root (a write attempt should fail to be correct)
k exec ro-rootfs -c app -- touch /test 2>&1 || echo "read-only confirmed"

When the uid, gid, and groups in the id output match the values you wrote in the manifest, the change has taken effect. If runAsNonRoot: true is set but the container comes up with CreateContainerConfigError, that’s a sign the image is built to run as root, so specify a non-root UID.

Putting it all together #

Here is a manifest that applies non-root execution, a read-only root filesystem, and minimal capabilities all at once. It’s the classic shape of a “hardened Pod” that exams ask for.

apiVersion: v1
kind: Pod
metadata:
  name: hardened
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
  containers:
    - name: app
      image: nginx:1.27
      ports:
        - containerPort: 8080
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop: ["ALL"]
          add: ["NET_BIND_SERVICE"]
      volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: run
          mountPath: /var/run
  volumes:
    - name: tmp
      emptyDir: {}
    - name: run
      emptyDir: {}

This Pod comes up as non-root UID 1000, has privilege escalation blocked, and drops all capabilities before re-attaching only NET_BIND_SERVICE, which is needed to bind to ports below 1024. The root is read-only, but mounting emptyDir on /tmp and /var/run keeps it running without trouble.

Exam points #

  • securityContext lives in two places — the Pod level (spec.securityContext) and the container level (spec.containers[].securityContext) — and the container level overrides the Pod level.
  • runAsUser, runAsGroup, runAsNonRoot, capabilities, readOnlyRootFilesystem, allowPrivilegeEscalation, and privileged are container-level; fsGroup and supplementalGroups are Pod-level only.
  • Write capability values without the CAP_ prefix. The model answer is drop: ["ALL"] then add back only what’s needed.
  • When readOnlyRootFilesystem: true blocks writing, work around it by mounting an emptyDir on the writable path.
  • If you set only runAsNonRoot: true without a non-root UID, a root image may fail to start, so specify runAsUser alongside it.
  • Verify with k exec -- id and k exec -- whoami. Confirm the actual UID/GID with your own eyes before grading.
  • When you’re unsure of a field’s location, check it instantly with k explain pod.spec.securityContext and k explain pod.spec.containers.securityContext.

Wrap-up #

What this post locked in:

  • The two levels of securityContext. The Pod level is the common default, the container level is the exception, and the container level takes precedence
  • Controlling the execution user. Enforce non-root with runAsUser, runAsGroup, and runAsNonRoot; secure volume write access with fsGroup
  • Filesystem control. readOnlyRootFilesystem + the emptyDir workaround, allowPrivilegeEscalation: false
  • capabilities. drop: ["ALL"] then minimal add; avoid privileged: true, which dismantles isolation
  • Verification. Confirm the actual identity with k exec -- id and whoami

Next: resource management #

Now that we’ve narrowed the container’s privileges, it’s time to control how much resource a container can use.

In #16 Resource Management: requests/limits, QoS class, LimitRange we’ll build it ourselves — how CPU and memory requests and limits affect scheduling and OOM, the QoS class their combination decides (Guaranteed, Burstable, BestEffort), and LimitRange and ResourceQuota for setting per-namespace defaults.

X