Certified Kubernetes Security Specialist (CKS) #8: kernel hardening, capabilities, /proc protection

In #6 AppArmor profiles and #7 seccomp profiles we narrowed the system calls a container can make and the files it can reach, at the kernel level. This post takes the same System Hardening domain one step higher: how to avoid handing Pods and containers excessive privileges in the first place. If AppArmor and seccomp are tools that block “what a container can do,” the securityContext settings in this post are tools that strip away “the privileges it holds to begin with.”

One of the most common task types on the CKS exam is finding a Pod running with excessive privileges and fixing it down to least privilege. Removing privileged: true, emptying capabilities with drop ALL and adding back only what’s needed, and lowering a container running as root to an unprivileged user are all regulars. In this post we’ll drill each of those moves one at a time.

Why strip privileges #

Containers share the host kernel. Unlike a virtual machine, the kernel isn’t isolated, so once enough privilege is gathered inside a container, a path opens to escape to the host. This is exactly the point an attacker aims for: get a foot inside a container through a vulnerable application, then use the container’s excessive privileges as a foothold to take over the host and other workloads.

So the core of System Hardening is simple: leave a container only the privileges it strictly needs, and remove everything else. Without a privilege, the attack that abuses it can’t even form. Every setting in this post is a concrete practice of that single sentence.

In CKAD #15 SecurityContext and Capabilities we covered the basics of using securityContext. CKS revisits the same fields from the security angle of shrinking the attack surface. We’ll pin down which attack each setting blocks as we go.

Linux capabilities #

Traditional Unix split privilege into just two: root and non-root. Root could do everything, non-root almost nothing. This binary was too coarse — even when a process needed one small privilege, like binding to a port below 1024, you had to run the whole process as root. Linux capabilities carve root’s all-powerful authority into about 40 small units of privilege. Give a process exactly the capabilities it needs, and it can do the job without root.

Dangerous capabilities #

A few capabilities are effectively as powerful as root. These are the ones to be especially wary of on the exam and in practice.

capabilityWhat it allowsRisk
SYS_ADMINBroad admin operations like mounting and namespace manipulationEffectively root. A regular at container escape
NET_ADMINChanging network interfaces, routing, firewall rulesTraffic interception, network bypass
NET_RAWCreating raw socketsPacket spoofing, ARP spoofing
SYS_PTRACETracing and manipulating other processes’ memoryCompromise of same-node processes
SYS_MODULELoading and unloading kernel modulesKernel-level takeover
DAC_OVERRIDEBypassing file permission checksReading and writing arbitrary files

Most applications need none of these. Yet container runtimes grant several capabilities by default. So the security baseline is to drop them all and then add back only what’s truly needed.

drop ALL, then add only what’s needed #

You adjust capabilities with drop and add under securityContext.capabilities. The safest pattern is to empty all capabilities with drop: ["ALL"], then list in add only what the application actually requires.

apiVersion: v1
kind: Pod
metadata:
  name: cap-minimal
spec:
  containers:
    - name: app
      image: nginx:1.27
      securityContext:
        capabilities:
          drop:
            - ALL
          add:
            - NET_BIND_SERVICE

The example above drops all capabilities, then adds back just NET_BIND_SERVICE, needed for binding to ports below 1024. For most web servers that’s enough. Remember that capability names are written without the CAP_ prefix. In the manifest you write NET_BIND_SERVICE, but getcap and the kernel docs show it as CAP_NET_BIND_SERVICE.

To check which capabilities a container is running with, do the following on the node.

# after finding the container PID
grep CapEff /proc/<pid>/status
# decode into human-readable form with capsh
capsh --decode=00000000a80425fb

The exam also has tasks like “remove NET_ADMIN from this container” — stripping a specific capability. In that case, delete it from the add list, or name that capability explicitly in drop.

The danger of privileged containers #

securityContext.privileged: true hands a container nearly all of the host’s privileges at once. Every capability is granted, every host device becomes reachable, and protections like AppArmor and seccomp are released by default. It’s effectively as if the container boundary doesn’t exist.

# A setting to avoid at all costs
securityContext:
  privileged: true

Inside a privileged container, you can mount the host’s disk, load kernel modules, or peer into other containers’ processes. In other words, it’s the widest path to container escape. Some system-level workloads (storage drivers, network plugins) do require privileged, but ordinary applications never need it. If you find a privileged Pod on the CKS exam, that itself is the flaw to fix.

The default is to set privileged: false explicitly or to leave the field out entirely. Even for a workload that seems to need privileged, a few specific capabilities added are usually enough.

allowPrivilegeEscalation: false #

allowPrivilegeEscalation decides whether a process inside the container is allowed to gain more privileges than the parent that started it. When this value is the default true, a process can pull up its own privileges through a setuid binary or file capabilities.

securityContext:
  allowPrivilegeEscalation: false

Setting this one line to false closes one privilege-escalation path inside the container. It’s the same as setting the Linux kernel’s no_new_privs flag. Even for a container running as non-root, explicitly turning this off is safer. If a container running as root has a setuid binary, it becomes a channel for privilege escalation, so this setting is especially important there.

runAsNonRoot and runAsUser #

Run a container as root (UID 0) and, on a container escape, the odds of holding powerful privileges on the host as well go up. So running a container as a non-root user should be the default.

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  runAsGroup: 3000
  • runAsNonRoot: true refuses to start the container if the image tries to start as root. Even when the image itself doesn’t guarantee non-root, it’s a safety net where the runtime blocks it one more time.
  • runAsUser: 1000 directly sets the UID the process runs as. It overrides the image’s default user.
  • runAsGroup sets the default group ID.

Leaving just runAsNonRoot: true and omitting runAsUser still starts normally, as long as the image is built with a non-root user. When the image only runs as root, you have to specify a non-root UID with runAsUser. That said, when running as an arbitrary UID, you need to verify that the application can reach that UID’s home directory and files.

readOnlyRootFilesystem #

readOnlyRootFilesystem: true makes the container’s root filesystem read-only. Even if an attacker gets inside the container, they can’t drop a malicious binary or tamper with existing files.

securityContext:
  readOnlyRootFilesystem: true

Most applications need some writable path for logs or temporary files. In that case, keep the root read-only and attach an emptyDir volume only to the paths that need writing.

apiVersion: v1
kind: Pod
metadata:
  name: readonly-root
spec:
  containers:
    - name: app
      image: nginx:1.27
      securityContext:
        readOnlyRootFilesystem: true
      volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: cache
          mountPath: /var/cache/nginx
        - name: run
          mountPath: /var/run
  volumes:
    - name: tmp
      emptyDir: {}
    - name: cache
      emptyDir: {}
    - name: run
      emptyDir: {}

This way the application runs normally while the root filesystem stays tamper-proof. This pattern is also the foundation of the immutable containers we’ll cover in #18 Container immutability.

Protecting /proc with procMount #

Container runtimes mask sensitive paths under /proc by default. Paths like /proc/kcore (kernel memory), parts of /proc/sys, and /proc/keys can leak host information or change kernel settings if read or written, so they’re hidden or bound read-only by default. The thing that controls this behavior is procMount.

securityContext:
  procMount: Default

procMount takes two values.

  • Default: the runtime masks the sensitive /proc paths and treats them as read-only. The default value and the safe one.
  • Unmasked: lifts all masking, so the container sees the whole of /proc as-is.

Unmasked is a dangerous setting that opens a path to host kernel memory and settings from inside the container. It has legitimate use only in some debugging tools or nested-container environments, and ordinary workloads never need it. If you find procMount: Unmasked on the CKS exam, the right answer is to revert it to Default or remove the field. Note that using Unmasked requires the loosest Pod Security level to be allowed, so the policies in #9 Pod Security Admission can block this as well.

Blocking host namespaces #

At the Pod level there are fields that open the container to share the host’s namespaces. These fields directly tear down the isolation between container and host, so from a security angle they’re the first thing to check.

FieldWhat happens when it’s onRisk
hostPID: trueThe container sees all of the host’s processesTracing and killing other workloads’ processes
hostNetwork: trueThe container uses the host’s network stack as-isExposure of all node ports, traffic interception
hostIPC: trueThe container shares the host’s IPC namespaceAccess to shared memory of the host and other containers
# All false is the default and the safe choice
spec:
  hostPID: false
  hostNetwork: false
  hostIPC: false

All three of these fields default to false, so in a safe manifest they don’t appear at all. If you see one set to true, that itself is a signal of a flaw. The exam gives instructions like “keep this Pod from seeing host processes,” which means removing hostPID.

The danger of host path mounts #

A hostPath volume mounts a host filesystem path directly into the container. Convenient, but the risk is large. Mount a path like /, /etc, or /var/run/docker.sock, and from inside the container you get your hands on the host’s configuration, credentials, and even the container socket.

# A very dangerous mount
volumes:
  - name: host-root
    hostPath:
      path: /

In particular, mount docker.sock or the containerd socket and the container can directly drive the host’s container runtime to spin up a new privileged container. That leads straight to host takeover. If you find a dangerous hostPath on the CKS exam, the right answer is to remove it or, when it’s truly needed, narrow the mount scope to the minimal subpath and keep it read-only (readOnly: true) where possible.

Putting it together: a hardened Pod #

Gather all the settings we’ve seen into one manifest and it looks like this. It’s a fitting starting point for running an ordinary web application safely.

apiVersion: v1
kind: Pod
metadata:
  name: hardened-app
spec:
  hostPID: false
  hostNetwork: false
  hostIPC: false
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      image: nginx:1.27
      securityContext:
        allowPrivilegeEscalation: false
        privileged: false
        readOnlyRootFilesystem: true
        runAsNonRoot: true
        runAsUser: 1000
        procMount: Default
        capabilities:
          drop:
            - ALL
          add:
            - NET_BIND_SERVICE
      volumeMounts:
        - name: tmp
          mountPath: /tmp
  volumes:
    - name: tmp
      emptyDir: {}

This manifest runs as non-root, blocks privilege escalation, empties capabilities, keeps the root filesystem read-only, protects /proc, and shares no host namespaces. Remember that when a Pod-level securityContext and a container-level securityContext overlap, the container level takes precedence.

Exam points #

  • Capabilities: drop ALL, then add. Emptying all capabilities and adding back only what’s needed is the standard. Names are written without the CAP_ prefix.
  • privileged: true is a flaw signal. When you find it, removal is the default. It’s usually replaced by a few specific capabilities.
  • allowPrivilegeEscalation: false. A one-liner that closes a privilege-escalation path; even non-root containers are safer with it specified.
  • runAsNonRoot, runAsUser. Lower the container to non-root. If the image only runs as root, specify the UID with runAsUser.
  • readOnlyRootFilesystem: true + emptyDir. Keep the root read-only and attach emptyDir to the writable paths.
  • procMount: Unmasked is dangerous. When you find it, revert to Default.
  • hostPID, hostNetwork, hostIPC, hostPath. All settings that tear down isolation from the host. If true or a dangerous path shows up, remove it.
  • The exam regular is finding an over-privileged Pod and fixing it to least privilege. You need to train the eye to read a manifest and immediately point at the risky fields.

Wrap-up #

What this post locked in:

  • The core of System Hardening is leaving a container only the privileges it strictly needs. No privilege, no abuse.
  • Linux capabilities carve root’s authority into small units, and the baseline is drop: ["ALL"] then add only what’s needed.
  • privileged, allowPrivilegeEscalation, runAsNonRoot, readOnlyRootFilesystem, and procMount are the core fields that govern a container’s attack surface.
  • hostPID, hostNetwork, hostIPC, and dangerous hostPath mounts directly tear down isolation from the host, so check them first.
  • On the exam, fixing an over-privileged Pod to least privilege is a regular, so practicing spotting the risky fields at a glance translates straight into score.

Next — Pod Security Admission #

So far we’ve fixed each manifest into a hardened state by hand. But you can’t have a person inspect every single Pod entering the cluster one by one. So you need a mechanism that automatically rejects dangerous Pods at the admission stage.

In #9 Pod Security Admission (PSA, Pod Security Standards) we’ll apply, hands-on, how to enforce privileged/baseline/restricted policy levels with a single namespace label, at which level the risky fields seen in this post get blocked, and how to combine the enforce/audit/warn modes.

X