K8s Advanced #1: CNI in Depth — Calico / Cilium / eBPF
The first post in the K8s Advanced series. In Intermediate #7 when covering NetworkPolicy, one line was left hanging: “NetworkPolicy is standard at the K8s manifest level, but actually blocking traffic is the CNI plugin’s job.” That one line is the topic of this post. The same kind: NetworkPolicy manifest is translated into iptables rules on Calico and into eBPF programs on Cilium. Even with the same shape, behavior, performance, and observability at the execution layer differ. This post walks through the requirements of the K8s network model, the identity of the CNI interface, the three data plane models (iptables / IPVS / eBPF), a comparison of Calico and Cilium, and practical criteria for CNI selection — all in one cycle.
This series is K8s Advanced, 6 posts.
- #1 CNI in depth — Calico / Cilium / eBPF ← this post
- #2 RBAC / ServiceAccount in depth — Aggregated ClusterRole / Impersonation / IRSA / Workload Identity
- #3 Admission Controller — OPA Gatekeeper / Kyverno
- #4 CRD and the Operator pattern — controller-runtime
- #5 Observability — Prometheus / Grafana / Loki / OpenTelemetry
- #6 GitOps — ArgoCD / Flux
kubectl apply into an error that points away from the real cause, leaving you to trace it back from the cluster side. Pasting the manifest into utilrepo’s YAML validator before applying surfaces syntax errors with line and column numbers. utilrepo is a collection of lightweight web utilities that run in your browser, so secrets never leave your machine, and it also catches multi-document manifests joined by --- and tab-space mixes you’d otherwise miss.Four requirements of the K8s network model #
Before talking about CNI, consider the conditions K8s requires of any network implementation. K8s itself doesn’t provide one. Instead, it specifies four conditions the network must satisfy, and satisfying them is the CNI plugin’s job.
| Condition | Description |
|---|---|
| Pod-to-Pod | Every Pod can communicate with every Pod without NAT |
| Node-to-Pod | Every node’s agents can communicate with every Pod without NAT |
| Pod self IP | The IP a Pod sees itself by is the same IP others see it by |
| Service abstraction | A virtual IP (ClusterIP) load-balances across multiple Pods |
The third condition isn’t self-evident in container environments. Docker’s default bridge network goes through NAT to talk to the outside, so a container’s own IP from inside differs from the IP others see from outside. K8s rejects this NAT model. Every Pod has a unique IP within the cluster, and uses that IP to refer to itself and to other Pods. On top of this simple model, higher-level objects like Service, DNS, and NetworkPolicy run consistently.
There is no single way to satisfy these four conditions. You could build an overlay between nodes (VXLAN, Geneve), advertise routes directly into routing tables via BGP, or rewrite the kernel’s packet processing path with eBPF. Which path you pick determines the data plane’s performance characteristics and observability. Choosing a CNI is choosing that path.
CNI — Container Network Interface #
CNI is not a K8s-only spec but the standard interface a container runtime calls when attaching a network to a container. It is managed by CNCF, and beyond kubelet, podman / cri-o / containerd all use the same interface.
The spec itself is very simple. When the container runtime creates a new container, it executes the CNI plugin binary and passes the container ID and network namespace path. The plugin makes an interface inside that namespace, allocates an IP, fills the routing table, and returns the result as JSON.
1. kubelet decides to create a Pod (make a new sandbox container for the Pod)
2. The container runtime (containerd, etc.) creates a network namespace
3. The CNI plugin is executed with that namespace path as argument (ADD command)
4. The plugin creates a veth interface inside the namespace and allocates an IP
5. The plugin updates host-side routing / iptables / eBPF maps, etc.
6. The plugin returns the allocated IP as JSON
7. kubelet records that IP in the Pod statusThe key here is that K8s itself doesn’t know about the actual network implementation between steps 4 and 6. Whether it makes a veth, MACVLAN, or intercepts packets via eBPF hooks — that responsibility lies entirely with the CNI plugin. K8s only receives the result that “the Pod has an IP and the four conditions above are satisfied.”
Thanks to this separation, the CNI can be swapped on the same K8s cluster (at cluster setup time). Whether you install Calico, Cilium, or Flannel, K8s API users write the same manifests. Only how that manifest actually takes effect differs. That “how” is the topic of this post.
Two parts of a CNI plugin #
A CNI plugin in an operational cluster usually splits into two parts.
- Node agent (DaemonSet) — one runs per node, managing routing / policy / IP allocation. Calico’s
calico-nodeand Cilium’scilium-agentplay this role. - CNI binary — installed in
/opt/cni/bin/, called directly by the container runtime. The node agent unpacks this binary into the node’s directory at boot.
These two parts work together to implement the four K8s requirements per node. Remembering that the manifest is simple but the operational shape splits into two layers makes debugging easier. A “Pod doesn’t get an IP” issue is a matter of stepping through the CNI binary, the node agent, and kubelet to find where things are stuck.
Three data plane models #
The path K8s network traffic actually flows through is called the data plane. Three models are commonly encountered in clusters: iptables-based, IPVS-based, and eBPF-based. We look at each in turn.
iptables-based — the oldest path #
K8s’s core component kube-proxy was built on iptables from the start. Distributing ClusterIP traffic to backend Pod IPs is unfolded as iptables NAT rules. One rule is added per Service, and rules are updated whenever Pod IPs change.
The strengths of this model are simplicity and compatibility. Almost every Linux kernel supports iptables, and debugging tools (iptables -L, iptables-save) are abundant. The weakness is that performance degrades as scale grows. iptables checks rules linearly. With 1,000 Services and 10 Pods behind each Service, deciding one piece of traffic requires sweeping nearly 5,000 rules on average. As cluster scale moves into the medium-to-large range, per-packet CPU cost rises noticeably.
sudo iptables -t nat -L KUBE-SERVICES -nKUBE-SVC-XYZAB1234567 tcp -- * * 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:dns */
KUBE-SVC-ABCDE9876543 tcp -- * * 0.0.0.0/0 10.96.45.12 /* default/web:http */
...When NetworkPolicy is laid down, iptables rules grow further. Calico’s default data plane mode is this path, and per-Pod ingress/egress policy rules are added to host iptables chains. The number of policies times the number of Pods drives rule count up quickly.
IPVS-based — kernel-level load balancing #
IPVS is the L4 load-balancing module the Linux kernel ships with. If iptables is a generalized NAT-rule tool, IPVS is a dedicated tool for load balancing itself. It’s hash-table based, so the lookup cost stays nearly constant as rule count grows.
K8s’s kube-proxy has officially supported IPVS mode since 1.11. Running with --proxy-mode=ipvs routes ClusterIP load balancing through IPVS. On large clusters (Services in the thousands or more), average latency and CPU usage are consistently lower than in iptables mode. However, policy enforcement like NetworkPolicy is still handled by iptables (or nftables), so IPVS is only a partial improvement.
eBPF-based — rewriting the kernel itself #
eBPF (extended Berkeley Packet Filter) is a mechanism that lets you safely embed user-defined programs inside the Linux kernel. Originally started for packet filtering, today small programs can hang on nearly all kernel hook points (system calls, network processing stages, tracing points).
The reason eBPF is meaningful for networking is simple. eBPF programs can replace iptables / IPVS roles, producing the same results with less CPU and richer observability information. Because code can be inserted directly into the path packets take through the kernel, NAT, load balancing, and policy checks are processed as a bundle. No linear sweep of iptables rules, no separate chains for NetworkPolicy.
[iptables model]
packet → conntrack → KUBE-SERVICES chain → KUBE-SVC-XXX → KUBE-SEP-YYY → DNAT → routing
(linear sweep over N rules)
[eBPF model]
packet → eBPF program at tc/XDP hook
→ eBPF map lookup (Service → Pod list, O(1))
→ policy map lookup (allow or not, O(1))
→ DNAT then forwardCilium is the representative implementation of this model. Calico has also offered an eBPF data plane mode as an option since 3.13. The differences between the two are covered in the next section.
Calico and Cilium — two paths #
The CNI plugins most commonly encountered in K8s clusters are Calico and Cilium. Both fully support NetworkPolicy and both have accumulated enough operational scale track record. The difference lies in the default model of the data plane and how deeply they rely on eBPF.
Calico — BGP + iptables by default, eBPF optional #
Calico’s default data plane is the combination of two parts.
- Inter-node routing — advertises each node’s Pod CIDR to other nodes via BGP (Border Gateway Protocol). Because the node itself acts like a router, no overlay (encapsulation like VXLAN) is needed. In environments where the cloud’s routing table doesn’t know about Pod CIDRs, there are also options to encapsulate via IP-in-IP or VXLAN.
- Intra-node policy / NAT — unfolds Service load balancing and NetworkPolicy via iptables. Same place as kube-proxy’s iptables mode.
The strength of this combination is operational familiarity. Routing runs on BGP, so it naturally pairs with data center BGP infrastructure (especially on-prem ToR-switch environments). iptables rules can be debugged with standard tools. The weakness is that as scale grows, iptables limits show up as is. As the count of Services / Pods / policies grows, rule count quickly inflates and kube-proxy’s sync time also lengthens.
From Calico 3.13, a mode that swaps the data plane to eBPF was added. In this mode, kube-proxy is no longer needed, and both Service load balancing and NetworkPolicy are unfolded via eBPF. However, Calico’s BGP routing model is kept, so it becomes a hybrid shape of “routing on BGP, data plane on eBPF.”
Cilium — eBPF from the start #
Cilium is a CNI designed with eBPF as a premise from the beginning. Service load balancing, NetworkPolicy, L7 policy (allow/deny per HTTP/gRPC method), inter-node encryption (WireGuard / IPsec), and observability (Hubble) are all unfolded as eBPF programs.
[Each node]
cilium-agent (DaemonSet)
├─ Compiles and loads eBPF programs
├─ Fills Service / Endpoint / NetworkPolicy into eBPF maps
└─ Attaches veth + eBPF hooks on Pod creation
Hubble (optional)
└─ Exposes flows / metrics collected from eBPFCilium has three distinguishing features:
- kube-proxy replacement — Service load balancing is handled by Cilium alone, so kube-proxy can be turned off. The cluster’s component count goes down, and the absence of iptables rules shortens the node’s packet processing path.
- L7 policy — the standard NetworkPolicy spec is limited to L4 (IP / port), but Cilium can express HTTP method/path, gRPC service/method, and Kafka topic-level policies via its own CRD (
CiliumNetworkPolicy). These policies are also unfolded as eBPF programs. - Hubble — eBPF-based observability — when packets pass eBPF hooks, metadata is collected to provide flow-level visibility. You can see in real time “which Pod calls which port of which Pod.” Observability is covered separately in #5, but Hubble naturally falling out as a byproduct of eBPF is one of Cilium’s appeals.
Comparison at a glance #
| Dimension | Calico (default) | Calico (eBPF mode) | Cilium |
|---|---|---|---|
| Pod routing | BGP / IP-in-IP / VXLAN | BGP / IP-in-IP / VXLAN | VXLAN / Geneve / native routing |
| Service load balancing | kube-proxy (iptables/IPVS) | eBPF | eBPF (can replace kube-proxy) |
| NetworkPolicy execution | iptables | eBPF | eBPF |
| L7 policy | Not supported (standard spec) | Not supported | Supported via CiliumNetworkPolicy |
| Observability | External tool needed | External tool needed | Hubble built in |
| Operational tool familiarity | Standard iptables tools | eBPF debugging needed | eBPF debugging needed |
| First adoption barrier | Low | Medium | Medium |
The same K8s manifest runs in any of the three columns. Only what shape that manifest unfolds into inside the node differs by column. This difference is reflected as is in the choice of performance, observability, and operational tools.
What eBPF changes #
Since eBPF comes up often when discussing CNI selection, it is worth looking at what eBPF has changed in K8s networking. eBPF itself isn’t a K8s component but a Linux kernel feature; as the K8s network data plane actively uses it, the operational model has shifted in meaningful ways.
kube-proxy’s role disappears #
For a long time, kube-proxy was a mandatory K8s cluster component. Its responsibility was distributing ClusterIP virtual IPs to actual Pod IPs, and it did that work via iptables or IPVS.
Cilium can fully replace kube-proxy with the kubeProxyReplacement: true option, and Calico’s eBPF mode does the same. Removing a component shrinks the operational surface accordingly: one fewer thing to monitor, one fewer candidate to suspect for sync delays, one fewer source of rule explosion.
NetworkPolicy’s cost model changes #
iptables-based NetworkPolicy grows rules in proportion to the number of policies and the number of Pods. In eBPF-based, policies are expressed as maps, so lookup cost stays nearly constant. In multi-tenant clusters where policy count grows into the hundreds or thousands, this difference shows up as per-packet latency.
Observability becomes a byproduct of the data plane #
In the traditional model, traffic visibility was separate work — attach a sidecar to a Pod, run tcpdump on a NodePort, install a separate monitoring agent. In the eBPF data plane, since packets pass through eBPF hooks anyway, collecting metadata (source Pod / destination Pod / policy check result / latency) along that path naturally produces flow-level visibility. Cilium’s Hubble is a direct product of this model.
CNI selection — practical criteria #
In theory, any CNI satisfies the four conditions K8s requires. In practice, the CNI decision for an operational cluster is where five dimensions converge.
| Dimension | Question |
|---|---|
| Cluster scale | What order of magnitude for Service / Pod / NetworkPolicy count |
| Network environment | Cloud managed, or on-prem BGP infrastructure |
| L7 policy need | Are HTTP / gRPC policies entering operational requirements |
| Operations team familiarity | Comfortable with iptables debugging, or willing to move to eBPF tools |
| Managed K8s default | Use the EKS / GKE / AKS default CNI as is, or swap |
Managed K8s comes with a default CNI the cloud vendor pushes. EKS has aws-vpc-cni (the model that allocates VPC IPs directly to Pods), GKE has its own CNI (VPC-native mode), AKS has Azure CNI or kubenet. This default CNI ties most smoothly with that cloud’s networking but may lack NetworkPolicy support or eBPF features, so swapping to Calico / Cilium per operational needs is common. On EKS, a common pattern is keeping aws-vpc-cni and laying NetworkPolicy on top via Calico’s or Cilium’s chained mode.
For small / simple clusters, sticking with the managed vendor’s default CNI is the most reasonable decision. Operational burden is smallest, and the breath with cloud support is best. As NetworkPolicy needs enter in earnest, or when multi-tenant isolation / L7 policy / fine-grained flow visibility become operational requirements, Calico / Cilium adoption review naturally starts.
The selection decision, in one line each:
- Calico (default mode) — the path a team with existing BGP infrastructure and iptables debugging familiarity can adopt fastest. The least burdensome choice on small to medium clusters.
- Calico (eBPF mode) — the mode for when you want to keep routing as is and only lift data plane performance. A compromise that keeps BGP assets while gaining eBPF benefits.
- Cilium — the right choice when you want L7 policy / Hubble observability / kube-proxy removal as a bundle. The choice that bets seriously on eBPF.
This decision isn’t easy to change once made. CNI replacement is close to reconfiguring the entire cluster’s network, so it is usually locked in at cluster setup time. When making the initial choice, it is worth factoring in the operational picture for the next one to two years.
Closing #
The first post in the K8s Advanced series is wrapped up. Starting from the four conditions the K8s network model requires, this post followed how the responsibility for satisfying them lies with the CNI plugin, and how depending on which path (iptables / IPVS / eBPF) that plugin unfolds the data plane through, the same manifest runs in different shapes. We compared Calico’s BGP + iptables model with Cilium’s eBPF-from-the-start model, and traced the flow of how eBPF replaced kube-proxy’s role, changed NetworkPolicy’s cost model, and made observability a byproduct of the data plane. The next post digs into, with the same depth, the part that was left as a single line in the Intermediate #7 RBAC section — covering the paths where the K8s permission model connects with external IAM: Aggregated ClusterRole, impersonation, EKS’s IRSA, and GKE’s Workload Identity.