CNI in Depth
How the same NetworkPolicy manifest resolves into iptables rules on Calico and into eBPF programs on Cilium — the depth of the data plane. We cover the four conditions of the Kubernetes network model, what the CNI interface actually is, the three data plane models (iptables · IPVS · eBPF), a comparison of Calico and Cilium, and the practical criteria for choosing a CNI.
This is the first chapter of Part 3 (Depth). When Chapter 14 RBAC / NetworkPolicy / ResourceQuota covered NetworkPolicy, we left one line behind: “NetworkPolicy is a standard at the K8s manifest level, but actually blocking traffic is the CNI plugin’s job.” That line is the subject of this chapter. The same kind: NetworkPolicy manifest resolves into iptables rules on Calico and into eBPF programs on Cilium. Even when the manifest is the same, behavior, performance, and observability differ at the execution layer. In this chapter we cover what the K8s network model requires, what the CNI interface actually is, the three models of the data plane (iptables / IPVS / eBPF), a comparison of Calico and Cilium, and the practical criteria for choosing a CNI.
By the end of this chapter you’ll be able to tell, in a single line, how a given K8s manifest behaves on a given CNI. The trade-offs of a CNI decision — the familiarity of iptables debugging vs. the performance and visibility of eBPF — also become tangible.
The four things the K8s network model requires #
Before talking about CNI, let’s pin down the conditions K8s requires of a network implementation. K8s itself doesn’t have a network implementation. Instead it lays down, as a specification, the four conditions a network must satisfy, and satisfying those conditions is the CNI plugin’s job.
| Condition | Description |
|---|---|
| Pod-to-Pod | Every Pod can communicate with every Pod without NAT |
| Node-to-Pod | Every node’s agent can communicate with every Pod without NAT |
| Pod self IP | The IP a Pod sees itself as and the IP another Pod calls that Pod by are the same |
| Service abstraction | A virtual IP (ClusterIP) load-balances across several Pods |
The third condition isn’t self-evident in a container environment. Docker’s default bridge network goes through NAT to communicate with the outside, so the self IP inside the container differs from the self IP seen from outside. K8s rejects this NAT model. Every Pod has a unique IP within the cluster, and it calls both itself and other Pods by that IP. On top of this simple model, higher-level objects like the ClusterIP · DNS of Chapter 5 Service and the NetworkPolicy of Chapter 14 run consistently.
There’s more than one way to satisfy these four conditions. You can build an overlay that links the nodes (VXLAN, Geneve), you can advertise routes directly into the routing table over BGP, or you can rewrite the kernel’s packet-processing path itself with eBPF. Which path you choose determines the data plane’s performance characteristics and observability. Choosing a CNI means choosing which path to take.
CNI — Container Network Interface #
CNI isn’t a K8s-only spec; it’s a standard interface that a container runtime calls when attaching a network to a container. It’s a specification managed by the CNCF, and beyond kubelet, podman / cri-o / containerd use the same interface.
The spec itself is very simple. When the container runtime creates a new container, it invokes the CNI plugin’s executable, passing the container ID and the path to the network namespace. The plugin creates an interface inside that namespace, assigns an IP, fills in the routing table, and returns the result as JSON.
1. kubelet decides to create a Pod (create a new sandbox container for the Pod)
2. The container runtime (containerd, etc.) creates a network namespace
3. The CNI plugin is run with that namespace path as an argument (ADD command)
4. The plugin creates a veth interface inside the namespace and assigns an IP
5. The plugin updates the host-side routing · iptables · eBPF maps, etc.
6. The plugin returns the assigned IP as JSON
7. kubelet records that IP in the Pod statusThe key here is that K8s itself doesn’t know the actual network implementation between steps 4 and 6. Whether it makes a veth, makes a MACVLAN, or intercepts packets with an eBPF hook — that responsibility lies entirely with the CNI plugin. K8s only receives the result that “an IP was attached to the Pod and the four conditions above are satisfied.”
Thanks to this separation, you can swap the CNI in the same K8s cluster at cluster setup time. Whether you install Calico, Cilium, or Flannel, the K8s API user writes the same manifest. What changes is how that manifest actually resolves. This chapter focuses on that “how it resolves.”
The two parts of a CNI plugin #
A production cluster’s CNI plugin is usually split into two parts.
- Node agent (DaemonSet) — one runs on each node and manages routing / policy / IP allocation. Calico’s
calico-nodeand Cilium’scilium-agentplay this role. - CNI binary — installed in
/opt/cni/bin/and called directly by the container runtime. The node agent unpacks this binary into the node’s directory at boot.
These two parts work together to implement the four conditions K8s requires, node by node. Remembering that the manifest is simple while the operational shape is split into two layers makes debugging easier. A “no IP attached to the Pod” problem is a matter of tracing, layer by layer, whether the CNI binary, the node agent, or kubelet is where things broke down. This diagnostic tree is organized once more in Chapter 27 kubectl debugging patterns.
The three models of the data plane #
The path K8s network traffic actually flows along is called the data plane. There are three models you commonly meet in a cluster — iptables-based, IPVS-based, and eBPF-based. Let’s take them one at a time.
iptables-based — the oldest path #
K8s’s built-in component kube-proxy was built on iptables from the start. It handles distributing traffic coming into a ClusterIP across the Pod IPs behind it with iptables NAT rules. A rule is added per Service, and the rules are updated whenever a Pod’s IP changes. We touched on this model once in §“kube-proxy” of Chapter 5 Service, and this chapter adds the depth.
The strength of this model is simplicity and compatibility. Nearly every Linux kernel supports iptables, and the debugging tools (iptables -L, iptables-save) are rich. The weakness is that performance degrades as scale grows. iptables checks rules linearly. If there are 1,000 Services with 10 Pods behind each, a single traffic decision has to scan close to 5,000 rules on average. As cluster scale moves into the mid-to-large range, the per-packet CPU cost rises noticeably.
sudo iptables -t nat -L KUBE-SERVICES -nKUBE-SVC-XYZAB1234567 tcp -- * * 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:dns */
KUBE-SVC-ABCDE9876543 tcp -- * * 0.0.0.0/0 10.96.45.12 /* default/web:http */
...When NetworkPolicy is installed, the iptables rules grow further. Calico’s default data plane mode is this path, and per Pod, ingress · egress policy rules are added to the host’s iptables chains. The number of policies and the number of Pods multiply, so the rule count grows quickly.
IPVS-based — load balancing at the kernel layer #
IPVS is a layer-4 load-balancing module the Linux kernel has. If iptables is a tool that generalized NAT rules, IPVS is a dedicated tool for load balancing itself. Because it works on a hash table, the lookup cost stays roughly constant even as the rule count grows.
K8s’s kube-proxy has officially supported IPVS mode since 1.11. Boot it with --proxy-mode=ipvs and ClusterIP load balancing is done with IPVS. On large clusters (thousands of Services or more), average latency and CPU usage are consistently lower than in iptables mode. That said, the policy layer like NetworkPolicy is still handled by iptables (or nftables), so IPVS is a partial improvement.
eBPF-based — the path that rewrites the kernel itself #
eBPF (extended Berkeley Packet Filter) is a mechanism for safely inserting user-defined programs inside the Linux kernel. It started out for packet filtering, but now you can attach small programs to almost any kernel hook point (system calls, network-processing stages, tracing points).
The reason eBPF matters on the network side is simple. eBPF programs can take over the role of iptables / IPVS, and produce the same outcome with less CPU and richer observability information. Because code can step directly into the path a packet takes through the kernel, NAT · load balancing · policy checks are handled together. It neither scans iptables rules linearly nor builds a separate chain for NetworkPolicy.
[iptables model]
packet → conntrack → KUBE-SERVICES chain → KUBE-SVC-XXX → KUBE-SEP-YYY → DNAT → routing
(linear scan of N rules)
[eBPF model]
packet → eBPF program at the tc/XDP hook
→ eBPF map lookup (Service → Pod list, O(1))
→ policy map lookup (allow or not, O(1))
→ forward after DNATCilium is the representative implementation of this model. Calico also offers an eBPF data plane mode as an option from 3.13. The difference between the two products is covered in the next section.
Calico and Cilium — two paths #
The CNI plugins you most often meet in a K8s cluster are Calico and Cilium. Both fully support NetworkPolicy, and both have accumulated plenty of operational-scale track record. The difference lies in the default model of the data plane and how deeply they depend on eBPF.
Calico — BGP · iptables by default, eBPF as an option #
Calico’s default data plane is a combination of two parts.
- Inter-node routing — it advertises each node’s Pod CIDR to the other nodes over BGP (Border Gateway Protocol). Because the node itself acts like a router, no overlay (encapsulation like VXLAN) is needed. In environments where the cloud’s routing table doesn’t know the Pod CIDR, there’s also an option to encapsulate with IP-in-IP or VXLAN.
- Intra-node policy / NAT — it handles Service load balancing and NetworkPolicy with iptables. This is the same position as kube-proxy’s iptables mode.
The strength of this combination is operational familiarity. Because routing runs on BGP, it ties in naturally with a data center’s BGP infrastructure (especially on-prem, ToR switch environments). iptables rules can be debugged with standard tools. The weakness is that as scale grows, iptables’s limits come along too. As the number of Services / Pods / policies grows, the rule count balloons quickly and kube-proxy’s sync time gets longer too.
From Calico 3.13, a mode that swaps the data plane out for eBPF was added. In this mode kube-proxy is no longer needed, and both Service load balancing and NetworkPolicy resolve through eBPF. However, Calico’s BGP routing model stays in place, so it becomes the mixed shape of “routing on BGP, data plane on eBPF.”
Cilium — eBPF from the start #
Cilium is a CNI designed around eBPF from the start. Service load balancing, NetworkPolicy, layer-7 policy (allow / deny per HTTP · gRPC method), inter-node encryption (WireGuard / IPsec), and observability (Hubble) are all handled by eBPF programs.
[each node]
cilium-agent (DaemonSet)
├─ compiles · loads eBPF programs
├─ fills Service / Endpoint / NetworkPolicy into eBPF maps
└─ attaches veth + eBPF hooks when a Pod is created
Hubble (optional)
└─ exposes flows · metrics collected from eBPFCilium’s distinguishing points are three.
- kube-proxy replacement — because Cilium alone handles Service load balancing, you can turn kube-proxy off. The cluster’s component count drops, and because iptables rules disappear, the node’s packet-processing path gets shorter.
- Layer-7 policy — the standard spec of NetworkPolicy is limited to layer 4 (IP / port), but Cilium can express policies per HTTP method · path, gRPC service / method, and Kafka topic with its own CRD (
CiliumNetworkPolicy). These policies, too, resolve through eBPF programs. - Hubble — eBPF-based observability — when a packet passes an eBPF hook, it collects metadata to provide flow-level visibility. You can see in real time “which Pod calls which port of which Pod.” Observability is covered separately in Chapter 19 Observability, but the fact that Hubble follows naturally as a byproduct of eBPF is one of Cilium’s attractions.
A comparison at a glance #
| Dimension | Calico (default) | Calico (eBPF mode) | Cilium |
|---|---|---|---|
| Pod routing | BGP / IP-in-IP / VXLAN | BGP / IP-in-IP / VXLAN | VXLAN / Geneve / native routing |
| Service load balancing | kube-proxy (iptables / IPVS) | eBPF | eBPF (can replace kube-proxy) |
| NetworkPolicy execution | iptables | eBPF | eBPF |
| Layer-7 policy | not supported (standard spec) | not supported | supported via CiliumNetworkPolicy |
| Observability | needs external tools | needs external tools | Hubble built in |
| Operational-tool familiarity | standard iptables tools | eBPF debugging needed | eBPF debugging needed |
| Barrier to first adoption | low | medium | medium |
The same K8s manifest works in any of the three columns. But what shape that manifest takes inside the node differs by column. This difference is reflected directly in the choice of performance · observability · operational tools.
What eBPF changes #
Since eBPF comes up often when discussing CNI choice, let’s pin down the changes eBPF has brought to the K8s network. eBPF itself isn’t a K8s component but a Linux kernel feature; still, as the K8s network’s data plane uses this feature aggressively, the operational model has changed shape.
kube-proxy’s role disappears #
For a long time, kube-proxy was an essential component of a K8s cluster. Distributing a ClusterIP’s virtual IP across the actual Pod IPs was this component’s responsibility, and it resolved that work with iptables or IPVS.
Cilium can fully replace kube-proxy with the kubeProxyReplacement: true option, and Calico’s eBPF mode does the same thing. When one component drops out, the operational surface area shrinks by that much — there’s one fewer thing to monitor, one fewer candidate to suspect for sync delays, and one fewer cause of rule explosion.
NetworkPolicy’s cost model changes #
iptables-based NetworkPolicy grows rules in proportion to the number of policies and the number of Pods. In the eBPF-based one, policies are expressed as maps, so the lookup cost stays roughly constant. In a multi-tenant cluster where the policy count grows to hundreds or thousands, this difference shows up as per-packet latency.
Observability becomes a byproduct of the data plane #
In the traditional model, traffic visibility was a separate task — attach a sidecar to the Pod, hang a tcpdump on the NodePort, or install a separate monitoring agent. In the eBPF data plane, since packets pass through eBPF hooks anyway, collecting metadata (source Pod / destination Pod / policy-check result / latency) along that path naturally produces flow-level visibility. Cilium’s Hubble is the direct product of this model.
Choosing a CNI — practical criteria #
In theory, any CNI satisfies the four conditions K8s requires. But a production cluster’s CNI decision is a decision where the following five dimensions converge.
| Dimension | Question |
|---|---|
| Cluster scale | How many digits is the count of Services / Pods / NetworkPolicies |
| Network environment | Is it cloud-managed, or on-prem BGP infrastructure |
| Whether layer-7 policy is needed | Are HTTP / gRPC-level policies in the operational requirements |
| The ops team’s familiarity | Comfortable with iptables debugging, or willing to move to eBPF tools |
| The managed K8s default | Will you use EKS / GKE / AKS’s default CNI unchanged, or swap it |
Managed K8s has a default CNI the cloud provider pushes. EKS has aws-vpc-cni (a model that assigns a VPC IP directly to a Pod), GKE has its own CNI (VPC-native mode), and AKS has Azure CNI or kubenet. This default CNI ties most smoothly into that cloud’s networking, but its NetworkPolicy support or eBPF features may fall short, so depending on operational requirements, swapping to Calico / Cilium is common. On EKS, a common pattern is to keep aws-vpc-cni while layering only NetworkPolicy on with Calico or Cilium’s chained mode. The decision of a full setup on EKS is covered once more in Chapter 21 EKS Cluster Setup.
On small, simple clusters, using the managed provider’s default CNI unchanged is the most reasonable decision. The operational burden is smallest, and the rapport with cloud support is best. Once NetworkPolicy requirements become serious, or once multi-tenant isolation · layer-7 policy · fine-grained flow visibility become operational requirements, the review of adopting Calico / Cilium naturally begins.
Let’s break down the tradeoffs, one line each.
- Calico (default mode) — the path a team that already has BGP infrastructure and is comfortable with iptables debugging can adopt fastest. The least burdensome choice on small-to-medium clusters.
- Calico (eBPF mode) — the mode to use when you want to leave routing unchanged and only improve data plane performance. A compromise that keeps the BGP assets alive while bringing in the benefits of eBPF.
- Cilium — the right choice when you want to take layer-7 policy · Hubble observability · kube-proxy removal together. A choice that bets seriously on eBPF.
This decision, once made, isn’t easy to change. Replacing the CNI is close to reconfiguring the entire cluster’s network, so it’s usually pinned at the time of a fresh cluster setup. So when deciding for the first time, it’s better to look at the operational picture for the next 1 ~ 2 years together.
Exercises #
- Check what the CNI of a cluster you operate or are studying is (
kubectl get pods -n kube-system | grep -E 'calico|cilium|flannel|aws-node'). Organize that CNI’s data plane model (iptables / IPVS / eBPF) into one line against §“The three models of the data plane,” and write a paragraph in your own words on how the same K8s manifest resolves on top of it. - On an iptables-based cluster, count the rules with
sudo iptables -t nat -L KUBE-SERVICES -n | wc -land note by what ratio it multiplies against the same cluster’s Service count (kubectl get svc -A | wc -l). Reason about how the linear-scan cost of §“iptables-based” accumulates when the Service count grows to thousands, and compare how it differs from the eBPF-based map lookup (O(1)). - Look at the §“A comparison at a glance” table of Calico and Cilium, and against the five decision dimensions of the cluster you’re about to operate (scale, network environment, whether layer-7 policy is needed, team familiarity, managed default), decide in one paragraph which side you’d pick and write the reason. If it’s an EKS environment, also compare which of
aws-vpc-cni+ Calico/Cilium chained mode and a single Cilium fits your requirements.
In one line: the responsibility for satisfying the four conditions of the K8s network model (NAT-free Pod-to-Pod, Node-to-Pod, Pod self IP, Service abstraction) lies with the CNI plugin. The same manifest resolves differently on Calico (BGP + iptables) · Calico eBPF · Cilium (eBPF from the start). The eBPF data plane takes over kube-proxy’s role, changes NetworkPolicy’s cost model from O(N) to O(1), and makes observability (Hubble) a byproduct of the data plane. A CNI choice, once made, is a hard decision to change, so look at the 1 ~ 2 year operational picture together at cluster setup time.
Next chapter #
Having unpacked the CNI’s data plane in this chapter, the next chapter handles, in the same direction, the part we cut off in one line in the RBAC section of Chapter 14 — Aggregated ClusterRole, impersonation, and the paths by which the K8s permission model ties into external IAM, like EKS’s IRSA and GKE’s Workload Identity.
Chapter 16 RBAC / ServiceAccount in depth organizes the aggregation mechanism of ClusterRole, the impersonation flow of the --as option, the lifecycle of a ServiceAccount token (projected token), and the mapping to cloud IAM — EKS’s IRSA · Pod Identity, GKE’s Workload Identity. Like this chapter’s CNI, it’s another cross-section of how a standard K8s object connects to cloud infrastructure.