Certified Kubernetes Security Specialist (CKS) #11: Isolation — gVisor, Kata Containers, RuntimeClass
The CKS series is working through the Minimize Microservice Vulnerabilities domain. Where the earlier #9 Pod Security Admission and #10 Secrets management dealt with a Pod’s privileges and secret data, this post goes one step deeper to lay out why the isolation of a container itself is weak and the sandbox runtimes that make up for it.
Containers are lightweight. But the price of that lightness is exactly where the security weakness lies. Because a container shares the host’s kernel directly, an attack that exploits a kernel vulnerability from inside a container can, if it succeeds, spread its damage across the whole host. What fills this gap is sandbox runtimes like gVisor and Kata Containers, and the mechanism Kubernetes uses to choose them is RuntimeClass.
Why container isolation is weak #
The difference between a virtual machine and a container boils down to one line: whether or not they share the kernel. A virtual machine runs its own kernel per guest, and underneath it a hypervisor virtualizes the hardware. A container, by contrast, borrows the host’s kernel as-is and only partitions the visible scope with namespaces and cgroups.
This structure makes containers lightweight and fast, but it makes the security boundary thin. Every system call running inside a container is ultimately handed to the same kernel on the host. That’s why situations like the following are dangerous.
- If a process inside a container exploits a kernel vulnerability, that vulnerability is a vulnerability of the host kernel. It can lead to a full host takeover through a container escape.
- If you run untrusted code (an image pulled from outside, a user workload in a multi-tenant environment) on the same node, a compromise of one container can spread to neighboring containers and to the node.
Restricting system calls with AppArmor and seccomp is a good way to reduce this attack surface, but the premise of sharing the same kernel doesn’t change. The idea behind sandbox runtimes is to wrap the shared kernel in a thicker layer or to separate it entirely.
In fact, several reported container escapes in the past exploited flaws in the kernel or the container runtime. Misconfigurations of privileged containers, memory-handling bugs in the kernel, attacks that overwrite the runtime binary, and the like fall into this category. What these attacks have in common is that they use the fact that the container and the host are looking at the same kernel as a lever. So when you run untrusted workloads, a defense that adds one more layer to the kernel boundary itself has meaning.
Two sandbox runtimes #
The sandbox runtimes commonly used in Kubernetes come in two strands. Since their approaches differ, let’s keep their principles distinct.
gVisor (runsc) #
gVisor is a sandbox runtime built by Google. Its core is to insert one more kernel, running in user space, between the host kernel and the container. This user-space kernel is runsc, and instead of passing the system calls the container makes straight to the host kernel, it intercepts them and handles them itself.
When a container makes a system call, that call goes to runsc first. Because runsc has implemented a large portion of the Linux system calls itself in user space, in many cases it responds without touching the host kernel. The calls that actually have to reach the host kernel are passed only through a very limited, narrow channel. As a result, the surface where the container directly touches the host kernel shrinks dramatically.
The price is performance. Since each system call goes through one more layer, performance drops for I/O-heavy workloads or workloads that make many system calls. Also, applications that use some system calls or features that runsc doesn’t implement may run into compatibility issues.
Kata Containers #
Kata Containers takes a different direction. It runs the container inside a lightweight virtual machine. Each Pod (or container) gets its own lightweight VM with a guest kernel inside it, so the isolation boundary becomes the VM level rather than the container level.
This way, even if the kernel is compromised from inside the container, that’s the guest kernel and not the host kernel, so reaching the host requires crossing the hypervisor boundary one more time. You get strong isolation on par with a virtual machine. In exchange, spinning up a VM increases startup time and memory usage, and the node needs virtualization support (nested virtualization and the like).
Comparing the two approaches #
| Item | gVisor (runsc) | Kata Containers |
|---|---|---|
| Isolation method | Intercept system calls with a user-space kernel | Lightweight VM + guest kernel |
| Isolation strength | Strong (reduced attack surface) | Stronger (VM boundary) |
| Performance cost | Degradation on system calls/I/O | VM startup/memory cost |
| Node requirements | Relatively light | Virtualization support required |
| Where it fits | Low-trust general workloads | Strong multi-tenancy isolation |
RuntimeClass #
Installing a sandbox runtime on a node doesn’t mean every Pod automatically uses that runtime. Kubernetes needs a mechanism to choose which Pod runs with which runtime, and that’s RuntimeClass.
RuntimeClass is a cluster-scoped resource that points to a container runtime configuration. The key field is handler, and this value must match the name of a handler defined in the node’s container runtime (containerd, etc.) configuration. For example, if you’ve registered a runsc handler in containerd, you set the RuntimeClass’s handler to runsc.
There’s one premise here. The runtime must already be installed on the node, and the handler must be registered in the container runtime configuration. RuntimeClass is just a label pointing to that handler — it doesn’t install the runtime itself. On the exam, the handler is usually given already prepared on the node, and the candidate takes on the part of creating the RuntimeClass and connecting it to the Pod.
Building it in YAML #
First, create a RuntimeClass that points to gVisor.
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: gvisor
handler: runscmetadata.name is the name the Pod references, and handler is the name of the handler registered in the node runtime. These two are easy to mix up. The Pod references the name (here, gvisor), and the node runtime looks for the handler (here, runsc).
When you apply the RuntimeClass you created to a Pod, you write the RuntimeClass’s name in the Pod spec’s runtimeClassName.
apiVersion: v1
kind: Pod
metadata:
name: sandboxed-nginx
spec:
runtimeClassName: gvisor
containers:
- name: nginx
image: nginx:1.27The moment you specify runtimeClassName: gvisor, this Pod’s containers start on the node through the runsc handler — that is, inside a gVisor sandbox. If another Pod leaves runtimeClassName empty, it runs with the node’s default runtime, so you can apply the sandbox selectively, only to the workloads that need it.
It’s the same shape when using Kata Containers. If a Kata handler (for example, kata) is registered on the node, you just change the RuntimeClass’s handler to that name.
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: kata
handler: kataAfter creating this RuntimeClass, specify runtimeClassName: kata on the Pod and that Pod runs inside a lightweight VM. The shape of the RuntimeClass is identical; only the handler it points to changes.
Verifying it applied #
Whether a Pod really runs inside a sandbox runtime becomes apparent when you look at kernel information from inside the Pod. A regular container shares the host kernel, so it sees the same kernel information as the host, but inside gVisor the information is reported by runsc and differs.
# the node's host kernel info
uname -r
# check the kernel info from inside the Pod
kubectl exec sandboxed-nginx -- uname -rIf the Pod runs inside gVisor, uname -r shows a kernel version that differs from the host — one that gVisor emulates. Also, when you view the kernel log with dmesg, the output differs from a regular container. gVisor doesn’t show the host’s actual kernel ring buffer as-is but emits its own messages, so the dmesg output of the two environments differs.
# check the kernel log from inside the gVisor Pod
kubectl exec sandboxed-nginx -- dmesgIf the output of uname -r and dmesg comes out differently from the host, that’s a sign the Pod is running inside a sandbox runtime. It’s worth making this a reflex as a way to quickly verify whether the RuntimeClass actually applied.
Trade-offs #
Sandbox runtimes aren’t free security. The stronger you make the isolation, the more you trade away performance and compatibility.
- Security vs. performance. gVisor goes through one more layer per system call, so it slows down on I/O-heavy workloads, and Kata incurs VM startup and memory costs. The stronger the isolation, the bigger the cost in general.
- Security vs. compatibility. gVisor doesn’t implement some system calls, so certain applications may not work. Workloads that access node devices directly or use special kernel features may not be a fit for a sandbox.
- Selective application. That’s why in practice you don’t run every Pod in a sandbox. The common approach is to apply RuntimeClass selectively only to low-trust workloads or to external code in a multi-tenant environment, and leave the rest on the default runtime.
Exam points #
- Creating a RuntimeClass and assigning it to a Pod is a staple task. The type “create a RuntimeClass that uses the
runschandler already registered on the node, and make a given Pod use it” shows up. You need to be able to finish the two steps — creating the RuntimeClass and addingruntimeClassNameto the Pod spec — quickly. - Distinguish
nameandhandler. The RuntimeClass’smetadata.nameis the name the Pod references, andhandleris the name of the handler registered in the node runtime. In the Pod’sruntimeClassNameyou must write the RuntimeClass’sname, not thehandler. - Remember the apiVersion. RuntimeClass is
node.k8s.io/v1. Copying it quickly from the docs is the safe move. - Assume the runtime is already installed. It’s rare for the candidate to install gVisor or Kata on a node during the exam. The handler is prepared, and the RuntimeClass and Pod connection is what’s graded.
- Know the verification commands. If you doubt whether it applied, enter the Pod with
kubectl execand check whetheruname -rordmesgdiffers from the host. - Browsing the gVisor docs is allowed. The official gVisor docs are a designated document you can browse during the exam, so learning where the RuntimeClass example lives ahead of time saves you time.
Wrap-up #
What this post locked in:
- The reason container isolation is weak is the shared host kernel. Since a system call inside the container ends up at the host kernel, a kernel vulnerability becomes the path to a container escape.
- gVisor (runsc) intercepts system calls with a user-space kernel to reduce the surface that touches the host kernel. It’s lightweight but comes with performance and compatibility costs.
- Kata Containers runs containers inside a lightweight VM to gain VM-level strong isolation. The isolation is stronger but comes with startup/memory cost and a virtualization requirement.
- RuntimeClass is a resource that points to the node’s runtime via
handler, and you apply it with the Pod’sruntimeClassName. The runtime must be installed on the node in advance. - The trade-off is security vs. performance/compatibility. Applying it selectively only to low-trust workloads is the practical pattern.
- For verification, check whether
uname -randdmesgdiffer from the host.
Next: Pod-to-Pod mTLS #
Once isolation has corralled workloads within a single node, the next turn is to protect the communication going between nodes. As the final topic of the same Minimize Microservice Vulnerabilities domain, we cover mTLS, which encrypts and mutually authenticates the traffic between Pods.
In #12 Pod-to-Pod mTLS: Cilium, we’ll lay out, building it by hand, how Cilium puts mTLS on Pod-to-Pod communication, how it meshes with NetworkPolicy, and how to solve the exam tasks that require encrypting communication.