Certified Kubernetes Administrator (CKA) #14 Scheduling 2: Taints/tolerations, Priority/PriorityClass, preemption

In #13 Scheduling 1 we covered the mechanism by which a Pod chooses a node through nodeSelector and affinity. This post runs in the opposite direction. We cover taints, where a node pushes Pods away, plus PriorityClass and preemption, which decide who gets saved first when resources run short.

If affinity is a Pod’s pull, a taint is a node’s push. The two face opposite directions, so keeping them together lets you control node placement from both sides. Add priority on top, and when a node is full the scheduler will evict lower-priority Pods to admit higher-priority ones — that’s preemption. From an operator’s point of view, we’ll work through each one hands-on with YAML and kubectl.

Taint and Toleration: the node rejects and the Pod accepts #

A taint and a toleration work as a pair. A taint is a rejection mark placed on a node, and a toleration is a Pod’s declaration that it will endure that rejection. When a node carries a taint, only Pods with a toleration that endures that taint may enter the node.

Node:  "I've placed a rejection mark called key=value:NoSchedule"
Pod A: no toleration         → cannot enter this node
Pod B: endures the same taint → can enter this node

The key point here is that a toleration is not permission but an exemption. A Pod with a toleration doesn’t necessarily go to that node — it is merely exempted from the node’s rejection. To actually send a Pod to that node, you also need the nodeAffinity or nodeSelector from #13. This distinction is a frequent point of confusion in the exam.

The roles of taint and affinity split as follows.

ToolDirectionSubjectEffect
nodeAffinity / nodeSelectorPullPodThe Pod chooses a specific node
taint / tolerationPushNodeThe node rejects Pods, and a toleration exempts from that rejection

How to apply a taint #

You apply a taint with kubectl taint node. The format is key=value:effect.

# Apply a taint to a node (key=value:effect)
k taint node node01 gpu=true:NoSchedule

# Remove a taint (append - at the end)
k taint node node01 gpu=true:NoSchedule-

# Check the taints on a node
k describe node node01 | grep -i taint
# Taints:  gpu=true:NoSchedule

The value can be omitted, in which case the taint holds with just the key and effect.

# A taint without a value
k taint node node01 dedicated:NoSchedule

The three effects: NoSchedule / PreferNoSchedule / NoExecute #

The strength of a taint is determined by its effect. There are three, and they differ in what they block.

effectNew Pod schedulingPods already running
NoScheduleplacement rejected without a tolerationleft alone
PreferNoScheduleavoided if possible, but not forcedleft alone
NoExecuteplacement rejected without a tolerationimmediately evicted without a toleration

NoSchedule is the most common effect, and it blocks only Pods that arrive going forward. PreferNoSchedule is a weaker version, so if no other node is available, the Pod ends up placed anyway. NoExecute is the strongest. On top of blocking new Pods, it also drives out Pods without a toleration that are already running on the node. It’s the effect you reach for when you need to empty a node hard.

# NoExecute: evicts even already-running Pods without a toleration
k taint node node01 maintenance=true:NoExecute

How to add a toleration #

For a Pod to endure a taint, you write a toleration under spec.tolerations. It must match the taint’s key, value, and effect.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  tolerations:
  - key: "gpu"
    operator: "Equal"     # endures a taint with the same key=value
    value: "true"
    effect: "NoSchedule"
  containers:
  - name: app
    image: nginx

There are two operators.

operatorMeaning
Equalendures a taint where key, value, and effect all match (value required)
Existsendures as long as the key (and effect) match, regardless of value (value omitted)

Exists doesn’t look at the value, so it’s convenient for enduring all taints of a particular key at once. Omitting the key as well makes a toleration that endures every taint — typically used for workloads that must run anywhere, like a DaemonSet.

# endures every taint (key omitted + Exists)
tolerations:
- operator: "Exists"

NoExecute and tolerationSeconds #

You can pair tolerationSeconds with a NoExecute taint. Giving it a value means that even a Pod with a toleration won’t stay indefinitely — it stays only for the specified number of seconds, then gets evicted.

tolerations:
- key: "maintenance"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 300   # endures for 300 seconds, then evicted

You can see this behavior directly in how Kubernetes handles node failures. When a node goes NotReady, the control plane automatically applies the node.kubernetes.io/not-ready:NoExecute taint. Every Pod has, by default, an automatically injected tolerationSeconds: 300 toleration for this taint, so if a node briefly drops out and comes back, its Pods survive, and if it stays out past 5 minutes, the Pods are moved to other nodes. tolerationSeconds is, in effect, the grace period between failure detection and rescheduling.

The default taint on control plane nodes #

Control plane nodes in a cluster built with kubeadm carry a default taint. That’s why ordinary workload Pods don’t land on the control plane.

k describe node controlplane | grep -i taint
# Taints:  node-role.kubernetes.io/control-plane:NoSchedule

Thanks to this taint, control plane components don’t compete with user workloads for resources. In a situation where you do need to run Pods on the control plane — like a single-node cluster — you simply remove this taint.

# Remove the default taint on the control plane node (- at the end)
k taint node controlplane node-role.kubernetes.io/control-plane:NoSchedule-

Conversely, the reason control plane component Pods (kube-apiserver and so on) can still run on the control plane despite the taint is that those static Pods carry a toleration that endures it.

Priority and PriorityClass #

So far we’ve dealt with where to place things. PriorityClass is a different dimension. When resources are short and you can’t run everyone, it decides who gets saved first.

PriorityClass is a cluster-scoped (non-namespaced) resource that defines an integer priority value. A Pod references this class with spec.priorityClassName, and that value becomes the Pod’s priority. The larger the value, the higher the priority.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Priority for critical workloads like payments"
FieldMeaning
valueThe priority integer. Larger is higher. User-defined values are usually below one billion
globalDefaultIf true, the default priority for Pods without a priorityClassName. A cluster can have only one
preemptionPolicyPreemptLowerPriority (default) performs preemption. Never stands at the front of the line but does not evict

On the Pod side, you reference it by name.

apiVersion: v1
kind: Pod
metadata:
  name: payment
spec:
  priorityClassName: high-priority
  containers:
  - name: app
    image: nginx
# List PriorityClasses and check their values
k get priorityclass
# NAME                      VALUE        GLOBAL-DEFAULT
# high-priority             1000000      false
# system-cluster-critical   2000000000   false
# system-node-critical      2000001000   false

system-node-critical and system-cluster-critical are system PriorityClasses that Kubernetes creates in advance. Node-essential components like kube-proxy and the CNI use these values so that, even when resources are short, they survive until the very last.

Preemption: evicting lower-priority Pods #

The moment priority truly flexes its power is preemption. When a higher-priority Pod is Pending and the cluster has no free room, the scheduler evicts lower-priority Pods to make room and places the higher-priority Pod there.

1. high-priority Pod is Pending. No node has room
2. scheduler: "Clearing this node's low-priority Pod opens up room"
3. low-priority Pod is evicted (graceful termination)
4. high-priority Pod is placed in that spot

An evicted lower-priority Pod moves elsewhere if another node has room, and stays Pending if none does. In other words, preemption is a device that enforces priority order when there’s competition over resources.

Setting preemptionPolicy to Never changes the behavior. This Pod still stands at the front of the scheduling line with its high priority, but it does not evict other Pods. It fits workloads — like batch jobs — that should wait until room opens up but must not drive others out.

# stand at the front of the line but don't evict
preemptionPolicy: Never

Preemption and PodDisruptionBudget #

Preemption proceeds gracefully, granting the targeted Pods a terminationGracePeriod. That said, a PodDisruptionBudget (PDB) cannot fully block preemption. The scheduler tries to respect a PDB as much as possible, but if there’s no other way to run a higher-priority Pod, it may evict even in violation of the PDB. We’ll meet the PDB again in the resource management flow from #15 onward.

How is this different from affinity #

The affinity from #13 and the tools in this post often get compared in the same breath. Summed up in a single paragraph: affinity is a Pod’s preference that pulls it toward a node, taint/toleration is a node’s rejection that pushes Pods away, and PriorityClass/preemption is the survival order when resources run short. The first two are placement problems — “which node can it go to” — and the last is a competition problem — “who stays when there isn’t enough room.” In practice you use all three together: protect GPU nodes from ordinary Pods with a taint, steer GPU workloads onto those nodes with nodeAffinity, and save the most important jobs among them first with a PriorityClass.

Exam points #

  • A toleration is not permission but an exemption. Even with a toleration there’s no guarantee the Pod goes to that node; to send it there, use affinity or nodeSelector alongside.
  • Distinguish the three effects: NoSchedule (new Pods only), PreferNoSchedule (weak avoidance), NoExecute (evicts even already-running Pods).
  • A taint is k taint node <node> key=value:effect; to remove, append - at the end.
  • The control plane’s default taint is node-role.kubernetes.io/control-plane:NoSchedule. On a single node, you must remove this taint for workloads to land.
  • tolerationSeconds is meaningful only with NoExecute, and the default grace for the not-ready/unreachable automatic taints is 300 seconds.
  • A PriorityClass value is higher the larger it is, and globalDefault: true is allowed on only one in the cluster.
  • Preemption evicts lower-priority Pods when a higher-priority Pod is Pending. With preemptionPolicy: Never, the Pod stands at the front of the line but does not evict.

Wrap-up #

What this post locked in:

  • The paired behavior where a taint is a node’s rejection and a toleration is the exemption from that rejection, and the core point that a toleration does not guarantee placement
  • The three effects (NoSchedule/PreferNoSchedule/NoExecute), the eviction under NoExecute, and the grace meaning of tolerationSeconds
  • The default taint on control plane nodes and its removal, plus how the not-ready automatic taint behaves
  • How to enforce the survival order in resource competition with PriorityClass (value/globalDefault/preemptionPolicy) and preemption
  • The role difference among affinity (pull), taint (push), and priority (survival order)

With scheduling covered, next we deal with the resources that placement depends on.

Next — Resource management #

Once you’ve decided which node a Pod goes on, the next problem is how to share that node’s resources. Set resources too low and the node gets overcrowded; set them too high and other Pods can’t enter even when the node is empty.

In #15 Resource management: requests/limits, QoS, LimitRange, ResourceQuota, we’ll cover the CPU and memory a container requests and limits (requests/limits), the QoS class determined by them (Guaranteed/Burstable/BestEffort), and the LimitRange and ResourceQuota that enforce defaults and caps at the namespace level. We’ll also sort out how resource settings connect to both scheduling and eviction.

X