A Practical Guide to NUMA Affinity in Kubernetes

NUMA effects are one of those problems that don’t show up in dashboards, but will happily show up in your p99 latency and in “why is this box slower than the identical box next to it?”

Kubernetes can help—but only if you enable the right node-level managers and verify the result from inside the workload.

What is NUMA?

Modern multi-socket servers split memory banks across CPU sockets. Each socket and its directly attached memory form a NUMA node (Non-Uniform Memory Access). Accessing memory on your own socket is fast (local); crossing the interconnect to another socket’s memory is slower (remote).

1
2
3
4
5
6
# On a 2-socket server, numactl --hardware shows the cost matrix:
numactl --hardware
# node distances:
# node   0   1
#   0:  10  21    ← accessing node 1’s memory from node 0 costs 2.1× more
#   1:  21  10

The distance ratio is typically 1.2–1.5× on 2-socket x86 servers, but can exceed 2× on 4-socket systems or cross-die AMD EPYC. For most stateless web services this is invisible. For latency-sensitive workloads — DPDK packet processing, ML inference, HPC simulations — it’s the difference between meeting your SLA and not.

The Problem — The “Cross-NUMA” tax

On multi-socket or multi-NUMA machines, not all CPU cores are equally “close” to all memory and PCIe devices. If a workload ends up with CPUs on one NUMA node and memory (or NIC / GPU) on another, you can pay a real latency / throughput penalty.

Kubernetes won’t automatically align these resources by default; alignment is handled by kubelet’s node resource managers and coordinated by Topology Manager.

A concrete example: Intel’s Topology Manager guide shows large throughput gaps for some DPDK packet sizes when resources are NUMA-aligned vs not aligned.

The Solution — Topology Manager (plus CPU + Memory Managers)

Kubernetes’ approach is “node admission + allocation hints”:

Hint providers (CPU Manager, Device Manager, Memory Manager) generate topology hints (NUMA bitmasks + preference).
Topology Manager merges hints, picks the best one, and checks it against a policy (best-effort / restricted / single-numa-node).
If the policy can’t be satisfied, the pod can be rejected and show up as TopologyAffinityError.

Key knobs you’ll actually use

CPU Manager:

static → allows Guaranteed pods with integer CPU requests to get exclusive CPUs (cpuset). (For a deep dive on this, check out my CPU Pinning guide).

Topology Manager policy:

best-effort: try to align, never reject.
restricted: reject if not “preferred”.
single-numa-node: reject unless all required resources can fit on one NUMA node.

Topology Manager scope:

container (default) aligns container-by-container
pod groups all containers in a pod to a common NUMA set — especially useful with single-numa-node.

Memory Manager:

Static (Linux) → provides NUMA hints to Topology Manager and enforces memory locality via cpuset.mems for Guaranteed pods.

1 — Configuration (kubelet)

1.1 — Minimal (CPU + Topology)

This gives you CPU/device locality alignment (and NUMA-aware admission decisions), but does not guarantee memory pages are constrained unless you also enable Memory Manager.

In kubelet config (KubeletConfiguration v1beta1), the relevant fields are documented explicitly.

1
2
3
4
5
6
7
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration

cpuManagerPolicy: static

topologyManagerPolicy: single-numa-node
topologyManagerScope: pod

Why cpuManagerPolicy: static matters: Kubernetes’ docs note that aligning CPU resources with other resources requires CPU Manager to be enabled with a suitable policy.

1.2 — “Full NUMA affinity” (CPU + Topology + Memory)

If you want kubelet to constrain memory NUMA nodes for Guaranteed pods, enable Memory Manager Static. It’s documented as stable (enabled by default) in Kubernetes v1.32+, and it feeds hints into Topology Manager.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration

cpuManagerPolicy: static

topologyManagerPolicy: single-numa-node
topologyManagerScope: pod

memoryManagerPolicy: Static
reservedMemory:
  - numaNode: 0
    limits:
      memory: "1Gi"
  - numaNode: 1
    limits:
      memory: "1Gi"

Memory Manager Static requires reserved memory configuration, and the docs call out constraints (including eviction thresholds).

Note: The default hard eviction threshold is 100MiB. Remember to increase the quantity of memory that you reserve by setting reservedMemory by that hard eviction threshold. Otherwise, the kubelet will not start Memory Manager and display an error.

1.3 — Optional but useful Topology Manager policy options

prefer-closest-numa-nodes=true — for best-effort and restricted policies, this makes Topology Manager prefer the hint with the fewest NUMA nodes. Reduces cross-NUMA traffic without hard rejection. GA in Kubernetes v1.26+.
max-allowable-numa-nodes — Topology Manager is disabled by default on nodes with more than 8 NUMA nodes (a safeguard against combinatorial explosion in hint calculation). Set this flag to raise the limit. Per the docs, enabling it on high-NUMA-count nodes is “at your own risk.”
full-pcpus-only=true (CPU Manager option) — allocates whole physical cores rather than individual hyperthreads. Useful when you want to avoid hyperthreading interference between tenants.
align-by-socket=true (CPU Manager option) — aligns CPU allocations to socket boundaries rather than just NUMA node boundaries. Relevant on systems where a socket contains multiple NUMA nodes (common on AMD EPYC).

1.4 — Which policy to pick

Situation	Recommended policy
Latency-sensitive, fits on one NUMA node	`single-numa-node` + `pod` scope
Large workload, NUMA alignment preferred but not mandatory	`restricted` + `prefer-closest-numa-nodes=true`
Dev/test or you just want observability	`best-effort`
ML/GPU workload (GPU on a specific NUMA node)	`single-numa-node` + Device Manager

Start with best-effort to observe what Topology Manager would do, then tighten to restricted or single-numa-node once you’ve confirmed nodes have enough NUMA-local resources for your pod sizes.

2 — Hands-On: Configuration examples by environment

2.1 — Local (Minikube)

Kubelet exposes flags for Topology Manager policy/scope and Memory Manager policy.

Example (CPU + Topology):

1
2
3
4
minikube start \
  --extra-config=kubelet.cpu-manager-policy=static \
  --extra-config=kubelet.topology-manager-policy=single-numa-node \
  --extra-config=kubelet.topology-manager-scope=pod

Note: many laptops/VMs report NUMA node(s): 1, so you won’t see meaningful NUMA placement differences locally. Use lscpu | grep "NUMA node(s)" to confirm.

2.2 — EKS (managed node groups via launch template)

EKS managed nodes run AL2 or AL2023. You inject extra kubelet flags via the bootstrap script in your launch template user data.

AL2 (amazon-eks-node-*):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#!/bin/bash
/etc/eks/bootstrap.sh my-cluster \
  --kubelet-extra-args \
    '--cpu-manager-policy=static
     --topology-manager-policy=single-numa-node
     --topology-manager-scope=pod
     --memory-manager-policy=Static
     --reserved-memory=0:memory=1Gi
     --reserved-memory=1:memory=1Gi
     --system-reserved=cpu=500m,memory=500Mi
     --kube-reserved=cpu=500m,memory=1Gi'

AL2023 (nodeadm config):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# /etc/eks/nodeadm-config.yaml (injected via user data)
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
  cluster:
    name: my-cluster
    apiServerEndpoint: https://<endpoint>
    certificateAuthority: <ca>
  kubelet:
    config:
      cpuManagerPolicy: static
      topologyManagerPolicy: single-numa-node
      topologyManagerScope: pod
      memoryManagerPolicy: Static
      reservedMemory:
        - numaNode: 0
          limits:
            memory: "1Gi"
        - numaNode: 1
          limits:
            memory: "1Gi"

Note: Topology Manager + Memory Manager only make sense on multi-NUMA instance types. Relevant EKS instance families: c5n, m5n, r5n (dual-socket), and the larger bare-metal instances like c5.metal or m6i.metal. Single-socket instances (most t3, m5, etc.) will still have a single NUMA node.

2.3 — EKS with Karpenter

If you use Karpenter for node provisioning, inject kubelet configuration via the NodeClass:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: numa-workers
spec:
  amiFamily: AL2023
  userData: |
    apiVersion: node.eks.aws/v1alpha1
    kind: NodeConfig
    spec:
      kubelet:
        config:
          cpuManagerPolicy: static
          topologyManagerPolicy: single-numa-node
          topologyManagerScope: pod

Pair this with a NodePool that selects multi-NUMA instance types (e.g., instance-family: [c5n, m5n, r5n]) — running single-numa-node on a single-NUMA instance wastes nothing but will produce TopologyAffinityError if a pod requests more CPUs than are available in the shared pool.

2.4 — Bottlerocket nodes

Bottlerocket documents first-class settings for:

settings.kubernetes.topology-manager-policy
settings.kubernetes.topology-manager-scope
settings.kubernetes.memory-manager-policy (+ reserved memory)

Example TOML:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
[settings.kubernetes]
cpu-manager-policy = "static"
topology-manager-policy = "single-numa-node"
topology-manager-scope = "pod"
memory-manager-policy = "Static"

[settings.kubernetes.memory-manager-reserved-memory.0]
enabled = true
memory = "1Gi"

[settings.kubernetes.memory-manager-reserved-memory.1]
enabled = true
memory = "1Gi"

Bottlerocket explicitly warns misconfiguring memory reservations can prevent kubelet from starting.

3 — Hands-on verification

3.1 — Confirm your node really has NUMA domains

On the node (or a privileged debug pod), check:

1
2
lscpu | egrep "Socket|NUMA node\(s\)"
numactl --hardware || true

Also useful to see how CPUs map to NUMA nodes:

1
for n in /sys/devices/system/node/node*/cpulist; do echo "$n: $(cat $n)"; done

To check memory allocation per NUMA node (whether memory is actually local):

1
2
numastat -m          # per-NUMA memory statistics (requires numactl package)
numastat -p <pid>    # per-NUMA memory mapping for a specific process

If numastat shows a large NumaForeign (memory allocated on a remote node) for your workload’s PID, you have a cross-NUMA memory placement problem — either Memory Manager isn’t configured, or the pod isn’t Guaranteed QoS.

3.2 — Deploy a “regular” pod vs a “NUMA-eligible” pod

Regular pod (shared pool / not eligible for exclusive CPU):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
apiVersion: v1
kind: Pod
metadata:
  name: floating-test
spec:
  containers:
  - name: test
    image: busybox
    command: ["sh","-c","sleep 3600"]
    resources:
      requests:
        cpu: "200m"
        memory: "128Mi"
      limits:
        cpu: "500m"
        memory: "256Mi"

NUMA-eligible pod (Guaranteed + integer CPU, and (optionally) memory that Memory Manager can act on):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
apiVersion: v1
kind: Pod
metadata:
  name: numa-test
spec:
  containers:
  - name: test
    image: busybox
    command: ["sh","-c","sleep 3600"]
    resources:
      requests:
        cpu: "2"
        memory: "1Gi"
      limits:
        cpu: "2"
        memory: "1Gi"

Why this matters:

CPU Manager static only gives exclusive CPUs to Guaranteed pods with integer CPU requests. (See the CPU Pinning guide for the exact mechanics).
Memory Manager Static only guarantees/enforces memory placement for Guaranteed pods.

HugePages and NUMA: For HPC and DPDK workloads, hugepages are almost always paired with NUMA affinity. Hugepages can be pre-allocated per NUMA node (/sys/devices/system/node/node0/hugepages/), and Memory Manager is aware of hugepage NUMA placement when hugepages-<size> resources are requested in the pod spec alongside regular memory. If your workload uses hugepages-1Gi or hugepages-2Mi, add them to both requests and limits (required for Guaranteed QoS) and they’ll be included in Topology Manager’s hint calculation.

3.3 — Verify CPU + memory NUMA constraints from inside the container

1
kubectl exec numa-test -- sh -c 'grep -E "Cpus_allowed_list|Mems_allowed_list" /proc/self/status'

What success looks like:

On a 2-socket node where each NUMA node has CPUs 0–15 and 16–31:

1
2
3
4
5
6
7
# floating-test (shared pool, no pinning):
Cpus_allowed_list:   0-31       ← can run on any CPU
Mems_allowed_list:   0-1        ← can allocate from any NUMA node

# numa-test (Guaranteed, CPU Manager + Memory Manager active):
Cpus_allowed_list:   0-1        ← pinned to 2 exclusive CPUs on NUMA node 0
Mems_allowed_list:   0          ← memory constrained to NUMA node 0 only

If Mems_allowed_list still shows 0-1 for the numa-test pod, Memory Manager is either not enabled or the pod doesn’t meet Guaranteed QoS requirements.

Interpretation:

CPU pinning shows up as a narrower Cpus_allowed_list when CPU Manager allocates exclusive CPUs.
Memory NUMA pinning shows up as a narrower Mems_allowed_list only if Memory Manager (or another mechanism) constrained cpuset.mems. Memory Manager docs explicitly mention it enforces cpuset.mems on Linux.

4 — Troubleshooting (what failure looks like)

4.1 — TopologyAffinityError

Kubernetes documents this exact failure mode and how it appears:

1
2
3
kubectl get pods
kubectl describe pod numa-test
kubectl get events --sort-by=.lastTimestamp

A pod can show STATUS: TopologyAffinityError, and describe / events will mention topology locality allocation failures.

4.2 — Known limitation: scheduler is not topology-aware

Topology Manager can reject a pod after it’s scheduled to a node; Kubernetes docs call this out as a limitation. The pod lands on a node, kubelet tries to admit it, and only then discovers the NUMA topology can’t be satisfied — resulting in TopologyAffinityError without the scheduler having any prior knowledge.

What to do about it:

Node labels + nodeSelector/nodeAffinity — label your NUMA-capable nodes (e.g., topology.kubernetes.io/numa-capable=true) and add a nodeSelector to workloads that require NUMA alignment. This keeps NUMA-sensitive pods off single-NUMA nodes but doesn’t guarantee a specific NUMA node within the machine.
Resource capacity — if your Guaranteed pod requests 32 CPUs but only 32 CPUs exist on the entire node (16 per NUMA node), single-numa-node will always reject. Right-size pod CPU/memory requests to fit within a single NUMA node’s resources.
Node Resource Topology (NRT) plugin — the scheduler-plugins project has a NodeResourceTopology plugin that makes the scheduler NUMA-aware by reading per-NUMA-node resource availability. This eliminates the schedule-then-reject cycle but requires deploying the secondary scheduler and populating NRT objects.

4.3 — Policy options and “distance”

Topology Manager is not NUMA-distance-aware unless you enable prefer-closest-numa-nodes.

5 — Operational footgun: stateful managers

Just like CPU Manager (which requires deleting /var/lib/kubelet/cpu_manager_state when changing policies, as discussed in the CPU Pinning guide), Memory Manager is also stateful.

For Memory Manager, Kubernetes troubleshooting docs point to /var/lib/kubelet/memory_manager_state as the internal state dump you can inspect while debugging topology management. If you change memory manager policies, you must drain the node, stop kubelet, and clear this state file.

Conclusion

If you want real NUMA affinity characteristics in Kubernetes, the most reliable stack is:

CPU Manager static (exclusive CPUs for eligible Guaranteed pods)
Topology Manager single-numa-node + pod scope (strict NUMA admission, pod-wide grouping)
Memory Manager Static (enforce memory locality via cpuset.mems for Guaranteed pods)

And then: verify from inside the container (Cpus_allowed_list, Mems_allowed_list) and watch for TopologyAffinityError when policies can’t be satisfied.

References & Further Reading

Control Memory Management Policies on a Node
Control Topology Management Policies on a Node
Control CPU Management Policies on the Node
scheduler-plugins: NodeResourceTopology — NUMA-aware scheduling plugin
Karpenter EC2NodeClass spec — kubelet configuration via NodeClass
EKS: Customizing kubelet configuration (AL2023 nodeadm)

What is NUMA?#

The Problem — The “Cross-NUMA” tax#

The Solution — Topology Manager (plus CPU + Memory Managers)#

Key knobs you’ll actually use#

1 — Configuration (kubelet)#

1.1 — Minimal (CPU + Topology)#

1.2 — “Full NUMA affinity” (CPU + Topology + Memory)#

1.3 — Optional but useful Topology Manager policy options#

1.4 — Which policy to pick#

2 — Hands-On: Configuration examples by environment#

2.1 — Local (Minikube)#

2.2 — EKS (managed node groups via launch template)#

2.3 — EKS with Karpenter#

2.4 — Bottlerocket nodes#

3 — Hands-on verification#

3.1 — Confirm your node really has NUMA domains#

3.2 — Deploy a “regular” pod vs a “NUMA-eligible” pod#

3.3 — Verify CPU + memory NUMA constraints from inside the container#

4 — Troubleshooting (what failure looks like)#

4.1 — TopologyAffinityError#

4.2 — Known limitation: scheduler is not topology-aware#

4.3 — Policy options and “distance”#

5 — Operational footgun: stateful managers#

Conclusion#

References & Further Reading#