NUMA effects are one of those problems that don’t show up in dashboards, but will happily show up in your p99 latency and in “why is this box slower than the identical box next to it?”
Kubernetes can help—but only if you enable the right node-level managers and verify the result from inside the workload.
What is NUMA?
Modern multi-socket servers split memory banks across CPU sockets. Each socket and its directly attached memory form a NUMA node (Non-Uniform Memory Access). Accessing memory on your own socket is fast (local); crossing the interconnect to another socket’s memory is slower (remote).
The distance ratio is typically 1.2–1.5× on 2-socket x86 servers, but can exceed 2× on 4-socket systems or cross-die AMD EPYC. For most stateless web services this is invisible. For latency-sensitive workloads — DPDK packet processing, ML inference, HPC simulations — it’s the difference between meeting your SLA and not.
The Problem — The “Cross-NUMA” tax
On multi-socket or multi-NUMA machines, not all CPU cores are equally “close” to all memory and PCIe devices. If a workload ends up with CPUs on one NUMA node and memory (or NIC / GPU) on another, you can pay a real latency / throughput penalty.
Kubernetes won’t automatically align these resources by default; alignment is handled by kubelet’s node resource managers and coordinated by Topology Manager.
A concrete example: Intel’s Topology Manager guide shows large throughput gaps for some DPDK packet sizes when resources are NUMA-aligned vs not aligned.
The Solution — Topology Manager (plus CPU + Memory Managers)
Kubernetes’ approach is “node admission + allocation hints”:
- Hint providers (CPU Manager, Device Manager, Memory Manager) generate topology hints (NUMA bitmasks + preference).
- Topology Manager merges hints, picks the best one, and checks it against a policy (best-effort / restricted / single-numa-node).
- If the policy can’t be satisfied, the pod can be rejected and show up as
TopologyAffinityError.
Key knobs you’ll actually use
CPU Manager:
static→ allows Guaranteed pods with integer CPU requests to get exclusive CPUs (cpuset). (For a deep dive on this, check out my CPU Pinning guide).
Topology Manager policy:
best-effort: try to align, never reject.restricted: reject if not “preferred”.single-numa-node: reject unless all required resources can fit on one NUMA node.
Topology Manager scope:
container(default) aligns container-by-containerpodgroups all containers in a pod to a common NUMA set — especially useful withsingle-numa-node.
Memory Manager:
Static(Linux) → provides NUMA hints to Topology Manager and enforces memory locality viacpuset.memsfor Guaranteed pods.
1 — Configuration (kubelet)
1.1 — Minimal (CPU + Topology)
This gives you CPU/device locality alignment (and NUMA-aware admission decisions), but does not guarantee memory pages are constrained unless you also enable Memory Manager.
In kubelet config (KubeletConfiguration v1beta1), the relevant fields are documented explicitly.
Why
cpuManagerPolicy: staticmatters: Kubernetes’ docs note that aligning CPU resources with other resources requires CPU Manager to be enabled with a suitable policy.
1.2 — “Full NUMA affinity” (CPU + Topology + Memory)
If you want kubelet to constrain memory NUMA nodes for Guaranteed pods, enable Memory Manager Static. It’s documented as stable (enabled by default) in Kubernetes v1.32+, and it feeds hints into Topology Manager.
| |
Memory Manager Static requires reserved memory configuration, and the docs call out constraints (including eviction thresholds).
Note: The default hard eviction threshold is 100MiB. Remember to increase the quantity of memory that you reserve by setting
reservedMemoryby that hard eviction threshold. Otherwise, the kubelet will not start Memory Manager and display an error.
1.3 — Optional but useful Topology Manager policy options
prefer-closest-numa-nodes=true— forbest-effortandrestrictedpolicies, this makes Topology Manager prefer the hint with the fewest NUMA nodes. Reduces cross-NUMA traffic without hard rejection. GA in Kubernetes v1.26+.max-allowable-numa-nodes— Topology Manager is disabled by default on nodes with more than 8 NUMA nodes (a safeguard against combinatorial explosion in hint calculation). Set this flag to raise the limit. Per the docs, enabling it on high-NUMA-count nodes is “at your own risk.”full-pcpus-only=true(CPU Manager option) — allocates whole physical cores rather than individual hyperthreads. Useful when you want to avoid hyperthreading interference between tenants.align-by-socket=true(CPU Manager option) — aligns CPU allocations to socket boundaries rather than just NUMA node boundaries. Relevant on systems where a socket contains multiple NUMA nodes (common on AMD EPYC).
1.4 — Which policy to pick
| Situation | Recommended policy |
|---|---|
| Latency-sensitive, fits on one NUMA node | single-numa-node + pod scope |
| Large workload, NUMA alignment preferred but not mandatory | restricted + prefer-closest-numa-nodes=true |
| Dev/test or you just want observability | best-effort |
| ML/GPU workload (GPU on a specific NUMA node) | single-numa-node + Device Manager |
Start with best-effort to observe what Topology Manager would do, then tighten to restricted or single-numa-node once you’ve confirmed nodes have enough NUMA-local resources for your pod sizes.
2 — Hands-On: Configuration examples by environment
2.1 — Local (Minikube)
Kubelet exposes flags for Topology Manager policy/scope and Memory Manager policy.
Example (CPU + Topology):
Note: many laptops/VMs report NUMA node(s): 1, so you won’t see meaningful NUMA placement differences locally. Use
lscpu | grep "NUMA node(s)"to confirm.
2.2 — EKS (managed node groups via launch template)
EKS managed nodes run AL2 or AL2023. You inject extra kubelet flags via the bootstrap script in your launch template user data.
AL2 (amazon-eks-node-*):
| |
AL2023 (nodeadm config):
| |
Note: Topology Manager + Memory Manager only make sense on multi-NUMA instance types. Relevant EKS instance families:
c5n,m5n,r5n(dual-socket), and the larger bare-metal instances likec5.metalorm6i.metal. Single-socket instances (mostt3,m5, etc.) will still have a single NUMA node.
2.3 — EKS with Karpenter
If you use Karpenter for node provisioning, inject kubelet configuration via the NodeClass:
| |
Pair this with a NodePool that selects multi-NUMA instance types (e.g., instance-family: [c5n, m5n, r5n]) — running single-numa-node on a single-NUMA instance wastes nothing but will produce TopologyAffinityError if a pod requests more CPUs than are available in the shared pool.
2.4 — Bottlerocket nodes
Bottlerocket documents first-class settings for:
settings.kubernetes.topology-manager-policysettings.kubernetes.topology-manager-scopesettings.kubernetes.memory-manager-policy(+ reserved memory)
Example TOML:
| |
Bottlerocket explicitly warns misconfiguring memory reservations can prevent kubelet from starting.
3 — Hands-on verification
3.1 — Confirm your node really has NUMA domains
On the node (or a privileged debug pod), check:
Also useful to see how CPUs map to NUMA nodes:
| |
To check memory allocation per NUMA node (whether memory is actually local):
If numastat shows a large NumaForeign (memory allocated on a remote node) for your workload’s PID, you have a cross-NUMA memory placement problem — either Memory Manager isn’t configured, or the pod isn’t Guaranteed QoS.
3.2 — Deploy a “regular” pod vs a “NUMA-eligible” pod
Regular pod (shared pool / not eligible for exclusive CPU):
NUMA-eligible pod (Guaranteed + integer CPU, and (optionally) memory that Memory Manager can act on):
Why this matters:
- CPU Manager static only gives exclusive CPUs to Guaranteed pods with integer CPU requests. (See the CPU Pinning guide for the exact mechanics).
- Memory Manager Static only guarantees/enforces memory placement for Guaranteed pods.
HugePages and NUMA: For HPC and DPDK workloads, hugepages are almost always paired with NUMA affinity. Hugepages can be pre-allocated per NUMA node (
/sys/devices/system/node/node0/hugepages/), and Memory Manager is aware of hugepage NUMA placement whenhugepages-<size>resources are requested in the pod spec alongside regular memory. If your workload useshugepages-1Giorhugepages-2Mi, add them to bothrequestsandlimits(required for Guaranteed QoS) and they’ll be included in Topology Manager’s hint calculation.
3.3 — Verify CPU + memory NUMA constraints from inside the container
| |
What success looks like:
On a 2-socket node where each NUMA node has CPUs 0–15 and 16–31:
| |
If Mems_allowed_list still shows 0-1 for the numa-test pod, Memory Manager is either not enabled or the pod doesn’t meet Guaranteed QoS requirements.
Interpretation:
- CPU pinning shows up as a narrower
Cpus_allowed_listwhen CPU Manager allocates exclusive CPUs. - Memory NUMA pinning shows up as a narrower
Mems_allowed_listonly if Memory Manager (or another mechanism) constrainedcpuset.mems. Memory Manager docs explicitly mention it enforcescpuset.memson Linux.
4 — Troubleshooting (what failure looks like)
4.1 — TopologyAffinityError
Kubernetes documents this exact failure mode and how it appears:
A pod can show STATUS: TopologyAffinityError, and describe / events will mention topology locality allocation failures.
4.2 — Known limitation: scheduler is not topology-aware
Topology Manager can reject a pod after it’s scheduled to a node; Kubernetes docs call this out as a limitation. The pod lands on a node, kubelet tries to admit it, and only then discovers the NUMA topology can’t be satisfied — resulting in TopologyAffinityError without the scheduler having any prior knowledge.
What to do about it:
- Node labels + nodeSelector/nodeAffinity — label your NUMA-capable nodes (e.g.,
topology.kubernetes.io/numa-capable=true) and add anodeSelectorto workloads that require NUMA alignment. This keeps NUMA-sensitive pods off single-NUMA nodes but doesn’t guarantee a specific NUMA node within the machine. - Resource capacity — if your Guaranteed pod requests 32 CPUs but only 32 CPUs exist on the entire node (16 per NUMA node),
single-numa-nodewill always reject. Right-size pod CPU/memory requests to fit within a single NUMA node’s resources. - Node Resource Topology (NRT) plugin — the
scheduler-pluginsproject has aNodeResourceTopologyplugin that makes the scheduler NUMA-aware by reading per-NUMA-node resource availability. This eliminates the schedule-then-reject cycle but requires deploying the secondary scheduler and populating NRT objects.
4.3 — Policy options and “distance”
Topology Manager is not NUMA-distance-aware unless you enable prefer-closest-numa-nodes.
5 — Operational footgun: stateful managers
Just like CPU Manager (which requires deleting /var/lib/kubelet/cpu_manager_state when changing policies, as discussed in the CPU Pinning guide), Memory Manager is also stateful.
For Memory Manager, Kubernetes troubleshooting docs point to /var/lib/kubelet/memory_manager_state as the internal state dump you can inspect while debugging topology management. If you change memory manager policies, you must drain the node, stop kubelet, and clear this state file.
Conclusion
If you want real NUMA affinity characteristics in Kubernetes, the most reliable stack is:
- CPU Manager static (exclusive CPUs for eligible Guaranteed pods)
- Topology Manager
single-numa-node+podscope (strict NUMA admission, pod-wide grouping) - Memory Manager Static (enforce memory locality via
cpuset.memsfor Guaranteed pods)
And then: verify from inside the container (Cpus_allowed_list, Mems_allowed_list) and watch for TopologyAffinityError when policies can’t be satisfied.
References & Further Reading
- Control Memory Management Policies on a Node
- Control Topology Management Policies on a Node
- Control CPU Management Policies on the Node
- scheduler-plugins: NodeResourceTopology — NUMA-aware scheduling plugin
- Karpenter EC2NodeClass spec — kubelet configuration via NodeClass
- EKS: Customizing kubelet configuration (AL2023 nodeadm)
