Certified Kubernetes Security Specialist (CKS) #7: seccomp Profiles

In #6 AppArmor Profiles, we bundled which files a container can access and which capabilities it can use into a profile. The counterpart tool in the same System Hardening domain is seccomp. Where AppArmor looks at files and capabilities, seccomp filters the system calls the container throws at the kernel themselves. In this post we cover the concept of seccomp and its three profile types, how to apply it to a Pod, how to load a custom profile onto the node and reference it, and how to verify that blocking actually works.

What is seccomp #

seccomp (secure computing mode) is a Linux kernel feature that restricts the system calls (syscalls) a process can make. A system call is the only path through which a user-space process asks the kernel to do work. Opening a file, creating a network socket, spawning a new process, loading a kernel module — every one of these happens through a system call. Linux has over 300 system calls, and most containers use only a tiny fraction of them.

The problem is that once an attacker takes over a container, they can freely use the rest of the system calls. System calls like mount, keyctl, unshare, and bpf can become footholds for privilege escalation and container escape. seccomp blocks the system calls a container does not use ahead of time, narrowing the attack surface at the system-call level.

A seccomp profile is a JSON document that sets a default action (defaultAction) and lists the system calls that are exceptions to it. The most common pattern is “block by default, but allow only known-safe system calls.”

The difference from AppArmor #

seccomp and AppArmor are both System Hardening tools, but they block at different layers.

ItemseccompAppArmor
TargetSystem calls (syscalls)File paths, capabilities, network
Question“Should I allow this system call?”“Should I read or write this file?”
Definition locationJSON profileText profile (/etc/apparmor.d/)
Apply keysecurityContext.seccompProfileannotation or securityContext.appArmorProfile
Apply scopePod or containercontainer

The two are complementary, not competing. Block dangerous system calls with seccomp and bundle file and capability access with AppArmor, and your defenses stack layer upon layer. The exam covers the two tools separately, but in practice applying them together is the standard.

The three profile types #

Kubernetes’s seccompProfile.type takes three values.

typeMeaningNotes
RuntimeDefaultApply the default profile the container runtime providesA sensible default vetted by containerd / CRI-O. Recommended
LocalhostReference a custom profile file loaded onto the nodeSpecify the file path with localhostProfile
UnconfinedNo seccomp applied. All system calls allowedEffectively defenseless. Avoid it

Make RuntimeDefault your default #

The first thing to memorize is the principle that you use RuntimeDefault as your default. Runtimes like containerd or CRI-O ship with a vetted default profile that blocks dangerous system calls a container workload almost never uses. This profile blocks dangerous system calls like mount, reboot, and keyctl without breaking ordinary applications.

One thing to watch for is the fact that Kubernetes’s past default was Unconfined. A Pod that did not specify seccomp came up with no system-call restriction at all. To fix this, turning on the kubelet’s --seccomp-default flag (or the SeccompDefault feature gate) automatically applies RuntimeDefault to every Pod that does not specify a profile. When the exam gives you a task like “enforce a default seccomp on every Pod on the node,” this is the flag to recall.

The securityContext.seccompProfile setting #

A seccomp profile is specified in two places: at the Pod level and at the container level. Placed at the Pod level it applies to all containers; placed at the container level it applies only to that container and overrides the Pod-level setting.

Applying RuntimeDefault to the whole Pod #

apiVersion: v1
kind: Pod
metadata:
  name: secure-app
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    image: nginx:1.27

A seccompProfile placed under spec.securityContext applies RuntimeDefault to every container in this Pod. Most exam tasks end in this form.

Applying at the container level #

apiVersion: v1
kind: Pod
metadata:
  name: mixed-app
spec:
  containers:
  - name: app
    image: nginx:1.27
    securityContext:
      seccompProfile:
        type: RuntimeDefault
  - name: sidecar
    image: busybox:1.36
    command: ["sleep", "3600"]
    securityContext:
      seccompProfile:
        type: Localhost
        localhostProfile: profiles/audit.json

This is an example of using a different profile per container within the same Pod. app references the runtime default, while sidecar references a custom profile loaded onto the node. The container-level setting takes precedence over the Pod-level setting.

Writing a custom profile #

When the runtime default is not enough, you write a JSON profile yourself. A custom profile must be loaded into a fixed directory on the node to be referenced with the Localhost type.

The profile directory #

The kubelet looks for custom seccomp profiles at the following path on the node.

/var/lib/kubelet/seccomp/

The path you write in localhostProfile is a relative path based on this directory. By convention, profiles are collected under profiles/. For example, if you place a file at the following location,

/var/lib/kubelet/seccomp/profiles/audit.json

the manifest references it as localhostProfile: profiles/audit.json. An absolute path or a path outside the directory is not allowed.

The profile JSON structure #

The core of a custom profile is two fields: defaultAction and syscalls.

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": [
    "SCMP_ARCH_X86_64",
    "SCMP_ARCH_X86",
    "SCMP_ARCH_X32"
  ],
  "syscalls": [
    {
      "names": [
        "accept4",
        "bind",
        "listen",
        "read",
        "write",
        "close",
        "exit_group"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

Since defaultAction is SCMP_ACT_ERRNO, every system call not listed is blocked, and calling it returns an error (EPERM). Only the system calls placed in the names of the syscalls block are allowed with SCMP_ACT_ALLOW. This “block by default + allow explicitly” approach is the safest whitelist pattern.

The main action values are as follows.

ActionBehavior
SCMP_ACT_ERRNOBlock the call. Return an error code
SCMP_ACT_ALLOWAllow the call
SCMP_ACT_LOGAllow but log it (for auditing)
SCMP_ACT_KILLKill the process on the call

An audit-purpose profile is written by setting defaultAction to SCMP_ACT_LOG to first observe which system calls are used, then building the allow list from those results.

A Pod that references the custom profile #

apiVersion: v1
kind: Pod
metadata:
  name: custom-seccomp
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/audit.json
  containers:
  - name: app
    image: nginx:1.27

Set type to Localhost and write a directory-relative path in localhostProfile. If that file is not on the node, the Pod errors out at the creation stage. When a custom-profile task comes up in the exam, you should first check whether the file is loaded at the correct path.

Verification #

The procedure for confirming that the seccomp you applied actually blocks system calls is the heart of verification.

Confirming the profile was applied #

kubectl get pod secure-app -o jsonpath='{.spec.securityContext.seccompProfile}'

Confirm that the profile type went into the Pod spec. For the container level, look at .spec.containers[0].securityContext.seccompProfile.

Testing a blocked system call #

In a profile whose defaultAction is block, deliberately call a system call not on the allow list and see whether it is blocked. For example, if the profile does not allow mkdir, directory creation should fail.

kubectl exec custom-seccomp -- mkdir /tmp/test
mkdir: can't create directory '/tmp/test': Operation not permitted

Operation not permitted is the signal that the system call was blocked by SCMP_ACT_ERRNO. Conversely, in a profile that allows ordinary operations like RuntimeDefault, a plain command should work normally. Checking both sides — “does what should be blocked get blocked, and does what should run still run” — finishes the task.

Checking whether it fell back to Unconfined #

A frequent mistake is thinking you specified a profile when the container actually runs as Unconfined. If seccompProfile is empty in the Pod spec and the kubelet’s --seccomp-default is also off, the container comes up with no system-call restriction. If the jsonpath query above returns empty, the profile was not applied, so review the manifest again.

Exam points #

  • RuntimeDefault is the recommended default. Most “apply seccomp to the Pod” tasks end with the single line securityContext.seccompProfile.type: RuntimeDefault.
  • A profile is specified in two places, the Pod level (spec.securityContext) and the container level (spec.containers[].securityContext), and the container level takes precedence.
  • Load a custom profile into the node’s /var/lib/kubelet/seccomp/ directory, and write a relative path based on this directory in localhostProfile.
  • The whitelist pattern for a JSON profile is the combination of defaultAction: SCMP_ACT_ERRNO (block by default) + SCMP_ACT_ALLOW for the allowed system calls.
  • Unconfined means no seccomp applied, so it is a value to avoid. Remember that not specifying a profile can fall back to the past default.
  • Node-wide enforcement is handled with the kubelet’s --seccomp-default flag (or the SeccompDefault feature gate).
  • For verification, confirm whether it is applied with kubectl get pod -o jsonpath, then finish by calling a target system call to block and checking that Operation not permitted comes out.

Wrap-up #

What this post locked in:

  • seccomp is a Linux kernel feature that filters the system calls a container throws at the kernel, narrowing the attack surface.
  • The profile types are three: RuntimeDefault (runtime default, recommended), Localhost (references a custom file on the node), and Unconfined (not applied).
  • You apply it with securityContext.seccompProfile, at both the Pod level and the container level.
  • A custom profile is loaded into /var/lib/kubelet/seccomp/ and referenced with the Localhost type, defining the allow list with defaultAction and syscalls.
  • If seccomp looks at system calls, AppArmor looks at files and capabilities. Stack the two together and the defense grows thicker.

Next — kernel hardening #

We bundled system calls with seccomp and files and capabilities with AppArmor. The last piece of the System Hardening domain is reducing the kernel privileges handed to the container itself.

In #8 kernel hardening, capabilities, /proc protection, we’ll build firsthand the kernel-level hardening settings — securityContext.capabilities to drop Linux capabilities to the minimum, allowPrivilegeEscalation and privileged to block privilege escalation, and /proc masking and readOnlyRootFilesystem.

X