NUMA effects are one of those problems that don’t show up in dashboards, but will happily show up in your p99 latency and in “why is this box slower than the identical box next to it?”
Kubernetes can help—but only if you enable the right node-level managers and verify the result from inside the workload.
The Problem — The “Cross-NUMA” tax
On multi-socket or multi-NUMA machines, not all CPU cores are equally “close” to all memory and PCIe devices. If a workload ends up with CPUs on one NUMA node and memory (or NIC / GPU) on another, you can pay a real latency / throughput penalty.
Kubernetes won’t automatically align these resources by default; alignment is handled by kubelet’s node resource managers and coordinated by Topology Manager.
A concrete example: Intel’s Topology Manager guide shows large throughput gaps for some DPDK packet sizes when resources are NUMA-aligned vs not aligned.
The Solution — Topology Manager (plus CPU + Memory Managers)
Kubernetes’ approach is “node admission + allocation hints”:
- Hint providers (CPU Manager, Device Manager, Memory Manager) generate topology hints (NUMA bitmasks + preference).
- Topology Manager merges hints, picks the best one, and checks it against a policy (best-effort / restricted / single-numa-node).
- If the policy can’t be satisfied, the pod can be rejected and show up as
TopologyAffinityError.
Key knobs you’ll actually use
CPU Manager:
static→ allows Guaranteed pods with integer CPU requests to get exclusive CPUs (cpuset). (For a deep dive on this, check out my CPU Pinning guide).
Topology Manager policy:
best-effort: try to align, never reject.restricted: reject if not “preferred”.single-numa-node: reject unless all required resources can fit on one NUMA node.
Topology Manager scope:
container(default) aligns container-by-containerpodgroups all containers in a pod to a common NUMA set — especially useful withsingle-numa-node.
Memory Manager:
Static(Linux) → provides NUMA hints to Topology Manager and enforces memory locality viacpuset.memsfor Guaranteed pods.
1 — Configuration (kubelet)
1.1 — Minimal (CPU + Topology)
This gives you CPU/device locality alignment (and NUMA-aware admission decisions), but does not guarantee memory pages are constrained unless you also enable Memory Manager.
In kubelet config (KubeletConfiguration v1beta1), the relevant fields are documented explicitly.
Why
cpuManagerPolicy: staticmatters: Kubernetes’ docs note that aligning CPU resources with other resources requires CPU Manager to be enabled with a suitable policy.
1.2 — “Full NUMA affinity” (CPU + Topology + Memory)
If you want kubelet to constrain memory NUMA nodes for Guaranteed pods, enable Memory Manager Static. It’s documented as stable (enabled by default) in Kubernetes v1.32+, and it feeds hints into Topology Manager.
| |
Memory Manager Static requires reserved memory configuration, and the docs call out constraints (including eviction thresholds).
Note: The default hard eviction threshold is 100MiB. Remember to increase the quantity of memory that you reserve by setting
reservedMemoryby that hard eviction threshold. Otherwise, the kubelet will not start Memory Manager and display an error.
1.3 — Optional but useful Topology Manager policy options
prefer-closest-numa-nodes=truehelps best-effort / restricted prefer closer NUMA sets; it’s GA (with some visibility rules depending on feature gates / version).max-allowable-numa-nodes=true(GA since Kubernetes 1.35) exists because Topology Manager is disabled by default on nodes with >8 NUMA nodes; enabling it is explicitly “at your own risk” per docs.
2 — Hands-On: Configuration examples by environment
2.1 — Local (Minikube)
Kubelet exposes flags for Topology Manager policy/scope and Memory Manager policy.
Example (CPU + Topology):
Note: many laptops/VMs report NUMA node(s): 1, so you won’t see meaningful NUMA placement differences locally. Use
lscpu | grep "NUMA node(s)"to confirm.
2.2 — EKS (high-level patterns)
EKS specifics vary by node type and provisioning method, but the AWS docs are clear that launch templates can pass bootstrap arguments (including extra kubelet args) for managed nodes.
A common pattern is passing kubelet flags like:
--cpu-manager-policy=static--topology-manager-policy=single-numa-node--topology-manager-scope=pod
Those flags and allowed values are in kubelet reference.
2.3 — Bottlerocket nodes
Bottlerocket documents first-class settings for:
settings.kubernetes.topology-manager-policysettings.kubernetes.topology-manager-scopesettings.kubernetes.memory-manager-policy(+ reserved memory)
Example TOML:
| |
Bottlerocket explicitly warns misconfiguring memory reservations can prevent kubelet from starting.
3 — Hands-on verification
3.1 — Confirm your node really has NUMA domains
On the node (or a privileged debug pod), check:
Also useful to see how CPUs map to NUMA nodes:
| |
3.2 — Deploy a “regular” pod vs a “NUMA-eligible” pod
Regular pod (shared pool / not eligible for exclusive CPU):
NUMA-eligible pod (Guaranteed + integer CPU, and (optionally) memory that Memory Manager can act on):
Why this matters:
- CPU Manager static only gives exclusive CPUs to Guaranteed pods with integer CPU requests. (See the CPU Pinning guide for the exact mechanics).
- Memory Manager Static only guarantees/enforces memory placement for Guaranteed pods.
3.3 — Verify CPU + memory NUMA constraints from inside the container
| |
Interpretation:
- CPU pinning shows up as a narrower
Cpus_allowed_listwhen CPU Manager allocates exclusive CPUs. - Memory NUMA pinning shows up as a narrower
Mems_allowed_listonly if Memory Manager (or another mechanism) constrainedcpuset.mems. Memory Manager docs explicitly mention it enforcescpuset.memson Linux.
4 — Troubleshooting (what failure looks like)
4.1 — TopologyAffinityError
Kubernetes documents this exact failure mode and how it appears:
A pod can show STATUS: TopologyAffinityError, and describe / events will mention topology locality allocation failures.
4.2 — Known limitation: scheduler is not topology-aware
Topology Manager can reject a pod after it’s scheduled to a node; Kubernetes docs call this out as a limitation.
4.3 — Policy options and “distance”
Topology Manager is not NUMA-distance-aware unless you enable prefer-closest-numa-nodes.
5 — Operational footgun: stateful managers
Just like CPU Manager (which requires deleting /var/lib/kubelet/cpu_manager_state when changing policies, as discussed in the CPU Pinning guide), Memory Manager is also stateful.
For Memory Manager, Kubernetes troubleshooting docs point to /var/lib/kubelet/memory_manager_state as the internal state dump you can inspect while debugging topology management. If you change memory manager policies, you must drain the node, stop kubelet, and clear this state file.
Conclusion
If you want real NUMA affinity characteristics in Kubernetes, the most reliable stack is:
- CPU Manager static (exclusive CPUs for eligible Guaranteed pods)
- Topology Manager
single-numa-node+podscope (strict NUMA admission, pod-wide grouping) - Memory Manager Static (enforce memory locality via
cpuset.memsfor Guaranteed pods)
And then: verify from inside the container (Cpus_allowed_list, Mems_allowed_list) and watch for TopologyAffinityError when policies can’t be satisfied.
