Certified Kubernetes Administrator (CKA) #14 Scheduling 2: Taints/tolerations, Priority/PriorityClass, preemption
In #13 Scheduling 1 we covered the mechanism by which a Pod chooses a node through nodeSelector and affinity. This post runs in the opposite direction. We cover taints, where a node pushes Pods away, plus PriorityClass and preemption, which decide who gets saved first when resources run short.
If affinity is a Pod’s pull, a taint is a node’s push. The two face opposite directions, so keeping them together lets you control node placement from both sides. Add priority on top, and when a node is full the scheduler will evict lower-priority Pods to admit higher-priority ones — that’s preemption. From an operator’s point of view, we’ll work through each one hands-on with YAML and kubectl.
Taint and Toleration: the node rejects and the Pod accepts #
A taint and a toleration work as a pair. A taint is a rejection mark placed on a node, and a toleration is a Pod’s declaration that it will endure that rejection. When a node carries a taint, only Pods with a toleration that endures that taint may enter the node.
Node: "I've placed a rejection mark called key=value:NoSchedule"
Pod A: no toleration → cannot enter this node
Pod B: endures the same taint → can enter this nodeThe key point here is that a toleration is not permission but an exemption. A Pod with a toleration doesn’t necessarily go to that node — it is merely exempted from the node’s rejection. To actually send a Pod to that node, you also need the nodeAffinity or nodeSelector from #13. This distinction is a frequent point of confusion in the exam.
The roles of taint and affinity split as follows.
| Tool | Direction | Subject | Effect |
|---|---|---|---|
| nodeAffinity / nodeSelector | Pull | Pod | The Pod chooses a specific node |
| taint / toleration | Push | Node | The node rejects Pods, and a toleration exempts from that rejection |
How to apply a taint #
You apply a taint with kubectl taint node. The format is key=value:effect.
# Apply a taint to a node (key=value:effect)
k taint node node01 gpu=true:NoSchedule
# Remove a taint (append - at the end)
k taint node node01 gpu=true:NoSchedule-
# Check the taints on a node
k describe node node01 | grep -i taint
# Taints: gpu=true:NoScheduleThe value can be omitted, in which case the taint holds with just the key and effect.
# A taint without a value
k taint node node01 dedicated:NoScheduleThe three effects: NoSchedule / PreferNoSchedule / NoExecute #
The strength of a taint is determined by its effect. There are three, and they differ in what they block.
| effect | New Pod scheduling | Pods already running |
|---|---|---|
NoSchedule | placement rejected without a toleration | left alone |
PreferNoSchedule | avoided if possible, but not forced | left alone |
NoExecute | placement rejected without a toleration | immediately evicted without a toleration |
NoSchedule is the most common effect, and it blocks only Pods that arrive going forward. PreferNoSchedule is a weaker version, so if no other node is available, the Pod ends up placed anyway. NoExecute is the strongest. On top of blocking new Pods, it also drives out Pods without a toleration that are already running on the node. It’s the effect you reach for when you need to empty a node hard.
# NoExecute: evicts even already-running Pods without a toleration
k taint node node01 maintenance=true:NoExecuteHow to add a toleration #
For a Pod to endure a taint, you write a toleration under spec.tolerations. It must match the taint’s key, value, and effect.
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
tolerations:
- key: "gpu"
operator: "Equal" # endures a taint with the same key=value
value: "true"
effect: "NoSchedule"
containers:
- name: app
image: nginxThere are two operators.
| operator | Meaning |
|---|---|
Equal | endures a taint where key, value, and effect all match (value required) |
Exists | endures as long as the key (and effect) match, regardless of value (value omitted) |
Exists doesn’t look at the value, so it’s convenient for enduring all taints of a particular key at once. Omitting the key as well makes a toleration that endures every taint — typically used for workloads that must run anywhere, like a DaemonSet.
# endures every taint (key omitted + Exists)
tolerations:
- operator: "Exists"NoExecute and tolerationSeconds #
You can pair tolerationSeconds with a NoExecute taint. Giving it a value means that even a Pod with a toleration won’t stay indefinitely — it stays only for the specified number of seconds, then gets evicted.
tolerations:
- key: "maintenance"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300 # endures for 300 seconds, then evictedYou can see this behavior directly in how Kubernetes handles node failures. When a node goes NotReady, the control plane automatically applies the node.kubernetes.io/not-ready:NoExecute taint. Every Pod has, by default, an automatically injected tolerationSeconds: 300 toleration for this taint, so if a node briefly drops out and comes back, its Pods survive, and if it stays out past 5 minutes, the Pods are moved to other nodes. tolerationSeconds is, in effect, the grace period between failure detection and rescheduling.
The default taint on control plane nodes #
Control plane nodes in a cluster built with kubeadm carry a default taint. That’s why ordinary workload Pods don’t land on the control plane.
k describe node controlplane | grep -i taint
# Taints: node-role.kubernetes.io/control-plane:NoScheduleThanks to this taint, control plane components don’t compete with user workloads for resources. In a situation where you do need to run Pods on the control plane — like a single-node cluster — you simply remove this taint.
# Remove the default taint on the control plane node (- at the end)
k taint node controlplane node-role.kubernetes.io/control-plane:NoSchedule-Conversely, the reason control plane component Pods (kube-apiserver and so on) can still run on the control plane despite the taint is that those static Pods carry a toleration that endures it.
Priority and PriorityClass #
So far we’ve dealt with where to place things. PriorityClass is a different dimension. When resources are short and you can’t run everyone, it decides who gets saved first.
PriorityClass is a cluster-scoped (non-namespaced) resource that defines an integer priority value. A Pod references this class with spec.priorityClassName, and that value becomes the Pod’s priority. The larger the value, the higher the priority.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Priority for critical workloads like payments"| Field | Meaning |
|---|---|
value | The priority integer. Larger is higher. User-defined values are usually below one billion |
globalDefault | If true, the default priority for Pods without a priorityClassName. A cluster can have only one |
preemptionPolicy | PreemptLowerPriority (default) performs preemption. Never stands at the front of the line but does not evict |
On the Pod side, you reference it by name.
apiVersion: v1
kind: Pod
metadata:
name: payment
spec:
priorityClassName: high-priority
containers:
- name: app
image: nginx# List PriorityClasses and check their values
k get priorityclass
# NAME VALUE GLOBAL-DEFAULT
# high-priority 1000000 false
# system-cluster-critical 2000000000 false
# system-node-critical 2000001000 falsesystem-node-critical and system-cluster-critical are system PriorityClasses that Kubernetes creates in advance. Node-essential components like kube-proxy and the CNI use these values so that, even when resources are short, they survive until the very last.
Preemption: evicting lower-priority Pods #
The moment priority truly flexes its power is preemption. When a higher-priority Pod is Pending and the cluster has no free room, the scheduler evicts lower-priority Pods to make room and places the higher-priority Pod there.
1. high-priority Pod is Pending. No node has room
2. scheduler: "Clearing this node's low-priority Pod opens up room"
3. low-priority Pod is evicted (graceful termination)
4. high-priority Pod is placed in that spotAn evicted lower-priority Pod moves elsewhere if another node has room, and stays Pending if none does. In other words, preemption is a device that enforces priority order when there’s competition over resources.
Setting preemptionPolicy to Never changes the behavior. This Pod still stands at the front of the scheduling line with its high priority, but it does not evict other Pods. It fits workloads — like batch jobs — that should wait until room opens up but must not drive others out.
# stand at the front of the line but don't evict
preemptionPolicy: NeverPreemption and PodDisruptionBudget #
Preemption proceeds gracefully, granting the targeted Pods a terminationGracePeriod. That said, a PodDisruptionBudget (PDB) cannot fully block preemption. The scheduler tries to respect a PDB as much as possible, but if there’s no other way to run a higher-priority Pod, it may evict even in violation of the PDB. We’ll meet the PDB again in the resource management flow from #15 onward.
How is this different from affinity #
The affinity from #13 and the tools in this post often get compared in the same breath. Summed up in a single paragraph: affinity is a Pod’s preference that pulls it toward a node, taint/toleration is a node’s rejection that pushes Pods away, and PriorityClass/preemption is the survival order when resources run short. The first two are placement problems — “which node can it go to” — and the last is a competition problem — “who stays when there isn’t enough room.” In practice you use all three together: protect GPU nodes from ordinary Pods with a taint, steer GPU workloads onto those nodes with nodeAffinity, and save the most important jobs among them first with a PriorityClass.
Exam points #
- A toleration is not permission but an exemption. Even with a toleration there’s no guarantee the Pod goes to that node; to send it there, use affinity or nodeSelector alongside.
- Distinguish the three effects:
NoSchedule(new Pods only),PreferNoSchedule(weak avoidance),NoExecute(evicts even already-running Pods). - A taint is
k taint node <node> key=value:effect; to remove, append-at the end. - The control plane’s default taint is
node-role.kubernetes.io/control-plane:NoSchedule. On a single node, you must remove this taint for workloads to land. tolerationSecondsis meaningful only withNoExecute, and the default grace for the not-ready/unreachable automatic taints is 300 seconds.- A PriorityClass
valueis higher the larger it is, andglobalDefault: trueis allowed on only one in the cluster. - Preemption evicts lower-priority Pods when a higher-priority Pod is
Pending. WithpreemptionPolicy: Never, the Pod stands at the front of the line but does not evict.
Wrap-up #
What this post locked in:
- The paired behavior where a taint is a node’s rejection and a toleration is the exemption from that rejection, and the core point that a toleration does not guarantee placement
- The three effects (
NoSchedule/PreferNoSchedule/NoExecute), the eviction underNoExecute, and the grace meaning oftolerationSeconds - The default taint on control plane nodes and its removal, plus how the not-ready automatic taint behaves
- How to enforce the survival order in resource competition with PriorityClass (value/globalDefault/preemptionPolicy) and preemption
- The role difference among affinity (pull), taint (push), and priority (survival order)
With scheduling covered, next we deal with the resources that placement depends on.
Next — Resource management #
Once you’ve decided which node a Pod goes on, the next problem is how to share that node’s resources. Set resources too low and the node gets overcrowded; set them too high and other Pods can’t enter even when the node is empty.
In #15 Resource management: requests/limits, QoS, LimitRange, ResourceQuota, we’ll cover the CPU and memory a container requests and limits (requests/limits), the QoS class determined by them (Guaranteed/Burstable/BestEffort), and the LimitRange and ResourceQuota that enforce defaults and caps at the namespace level. We’ll also sort out how resource settings connect to both scheduling and eviction.