NUMA effects are one of those problems that don’t show up in dashboards, but will happily show up in your p99 latency and in “why is this box slower than the identical box next to it?”

Kubernetes can help—but only if you enable the right node-level managers and verify the result from inside the workload.

The Problem — The “Cross-NUMA” tax

On multi-socket or multi-NUMA machines, not all CPU cores are equally “close” to all memory and PCIe devices. If a workload ends up with CPUs on one NUMA node and memory (or NIC / GPU) on another, you can pay a real latency / throughput penalty.

Kubernetes won’t automatically align these resources by default; alignment is handled by kubelet’s node resource managers and coordinated by Topology Manager.

A concrete example: Intel’s Topology Manager guide shows large throughput gaps for some DPDK packet sizes when resources are NUMA-aligned vs not aligned.

The Solution — Topology Manager (plus CPU + Memory Managers)

Kubernetes’ approach is “node admission + allocation hints”:

  • Hint providers (CPU Manager, Device Manager, Memory Manager) generate topology hints (NUMA bitmasks + preference).
  • Topology Manager merges hints, picks the best one, and checks it against a policy (best-effort / restricted / single-numa-node).
  • If the policy can’t be satisfied, the pod can be rejected and show up as TopologyAffinityError.

Key knobs you’ll actually use

CPU Manager:

  • static → allows Guaranteed pods with integer CPU requests to get exclusive CPUs (cpuset). (For a deep dive on this, check out my CPU Pinning guide).

Topology Manager policy:

  • best-effort: try to align, never reject.
  • restricted: reject if not “preferred”.
  • single-numa-node: reject unless all required resources can fit on one NUMA node.

Topology Manager scope:

  • container (default) aligns container-by-container
  • pod groups all containers in a pod to a common NUMA set — especially useful with single-numa-node.

Memory Manager:

  • Static (Linux) → provides NUMA hints to Topology Manager and enforces memory locality via cpuset.mems for Guaranteed pods.

1 — Configuration (kubelet)

1.1 — Minimal (CPU + Topology)

This gives you CPU/device locality alignment (and NUMA-aware admission decisions), but does not guarantee memory pages are constrained unless you also enable Memory Manager.

In kubelet config (KubeletConfiguration v1beta1), the relevant fields are documented explicitly.

1
2
3
4
5
6
7
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration

cpuManagerPolicy: static

topologyManagerPolicy: single-numa-node
topologyManagerScope: pod

Why cpuManagerPolicy: static matters: Kubernetes’ docs note that aligning CPU resources with other resources requires CPU Manager to be enabled with a suitable policy.

1.2 — “Full NUMA affinity” (CPU + Topology + Memory)

If you want kubelet to constrain memory NUMA nodes for Guaranteed pods, enable Memory Manager Static. It’s documented as stable (enabled by default) in Kubernetes v1.32+, and it feeds hints into Topology Manager.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration

cpuManagerPolicy: static

topologyManagerPolicy: single-numa-node
topologyManagerScope: pod

memoryManagerPolicy: Static
reservedMemory:
  - numaNode: 0
    limits:
      memory: "1Gi"
  - numaNode: 1
    limits:
      memory: "1Gi"

Memory Manager Static requires reserved memory configuration, and the docs call out constraints (including eviction thresholds).

Note: The default hard eviction threshold is 100MiB. Remember to increase the quantity of memory that you reserve by setting reservedMemory by that hard eviction threshold. Otherwise, the kubelet will not start Memory Manager and display an error.

1.3 — Optional but useful Topology Manager policy options

  • prefer-closest-numa-nodes=true helps best-effort / restricted prefer closer NUMA sets; it’s GA (with some visibility rules depending on feature gates / version).
  • max-allowable-numa-nodes=true (GA since Kubernetes 1.35) exists because Topology Manager is disabled by default on nodes with >8 NUMA nodes; enabling it is explicitly “at your own risk” per docs.

2 — Hands-On: Configuration examples by environment

2.1 — Local (Minikube)

Kubelet exposes flags for Topology Manager policy/scope and Memory Manager policy.

Example (CPU + Topology):

1
2
3
4
minikube start \
  --extra-config=kubelet.cpu-manager-policy=static \
  --extra-config=kubelet.topology-manager-policy=single-numa-node \
  --extra-config=kubelet.topology-manager-scope=pod

Note: many laptops/VMs report NUMA node(s): 1, so you won’t see meaningful NUMA placement differences locally. Use lscpu | grep "NUMA node(s)" to confirm.

2.2 — EKS (high-level patterns)

EKS specifics vary by node type and provisioning method, but the AWS docs are clear that launch templates can pass bootstrap arguments (including extra kubelet args) for managed nodes.

A common pattern is passing kubelet flags like:

  • --cpu-manager-policy=static
  • --topology-manager-policy=single-numa-node
  • --topology-manager-scope=pod

Those flags and allowed values are in kubelet reference.

2.3 — Bottlerocket nodes

Bottlerocket documents first-class settings for:

  • settings.kubernetes.topology-manager-policy
  • settings.kubernetes.topology-manager-scope
  • settings.kubernetes.memory-manager-policy (+ reserved memory)

Example TOML:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
[settings.kubernetes]
cpu-manager-policy = "static"
topology-manager-policy = "single-numa-node"
topology-manager-scope = "pod"
memory-manager-policy = "Static"

[settings.kubernetes.memory-manager-reserved-memory.0]
enabled = true
memory = "1Gi"

[settings.kubernetes.memory-manager-reserved-memory.1]
enabled = true
memory = "1Gi"

Bottlerocket explicitly warns misconfiguring memory reservations can prevent kubelet from starting.

3 — Hands-on verification

3.1 — Confirm your node really has NUMA domains

On the node (or a privileged debug pod), check:

1
2
lscpu | egrep "Socket|NUMA node\(s\)"
numactl --hardware || true

Also useful to see how CPUs map to NUMA nodes:

1
for n in /sys/devices/system/node/node*/cpulist; do echo "$n: $(cat $n)"; done

3.2 — Deploy a “regular” pod vs a “NUMA-eligible” pod

Regular pod (shared pool / not eligible for exclusive CPU):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
apiVersion: v1
kind: Pod
metadata:
  name: floating-test
spec:
  containers:
  - name: test
    image: busybox
    command: ["sh","-c","sleep 3600"]
    resources:
      requests:
        cpu: "200m"
        memory: "128Mi"
      limits:
        cpu: "500m"
        memory: "256Mi"

NUMA-eligible pod (Guaranteed + integer CPU, and (optionally) memory that Memory Manager can act on):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
apiVersion: v1
kind: Pod
metadata:
  name: numa-test
spec:
  containers:
  - name: test
    image: busybox
    command: ["sh","-c","sleep 3600"]
    resources:
      requests:
        cpu: "2"
        memory: "1Gi"
      limits:
        cpu: "2"
        memory: "1Gi"

Why this matters:

  • CPU Manager static only gives exclusive CPUs to Guaranteed pods with integer CPU requests. (See the CPU Pinning guide for the exact mechanics).
  • Memory Manager Static only guarantees/enforces memory placement for Guaranteed pods.

3.3 — Verify CPU + memory NUMA constraints from inside the container

1
kubectl exec numa-test -- sh -c 'grep -E "Cpus_allowed_list|Mems_allowed_list" /proc/self/status'

Interpretation:

  • CPU pinning shows up as a narrower Cpus_allowed_list when CPU Manager allocates exclusive CPUs.
  • Memory NUMA pinning shows up as a narrower Mems_allowed_list only if Memory Manager (or another mechanism) constrained cpuset.mems. Memory Manager docs explicitly mention it enforces cpuset.mems on Linux.

4 — Troubleshooting (what failure looks like)

4.1 — TopologyAffinityError

Kubernetes documents this exact failure mode and how it appears:

1
2
3
kubectl get pods
kubectl describe pod numa-test
kubectl get events --sort-by=.lastTimestamp

A pod can show STATUS: TopologyAffinityError, and describe / events will mention topology locality allocation failures.

4.2 — Known limitation: scheduler is not topology-aware

Topology Manager can reject a pod after it’s scheduled to a node; Kubernetes docs call this out as a limitation.

4.3 — Policy options and “distance”

Topology Manager is not NUMA-distance-aware unless you enable prefer-closest-numa-nodes.

5 — Operational footgun: stateful managers

Just like CPU Manager (which requires deleting /var/lib/kubelet/cpu_manager_state when changing policies, as discussed in the CPU Pinning guide), Memory Manager is also stateful.

For Memory Manager, Kubernetes troubleshooting docs point to /var/lib/kubelet/memory_manager_state as the internal state dump you can inspect while debugging topology management. If you change memory manager policies, you must drain the node, stop kubelet, and clear this state file.

Conclusion

If you want real NUMA affinity characteristics in Kubernetes, the most reliable stack is:

  1. CPU Manager static (exclusive CPUs for eligible Guaranteed pods)
  2. Topology Manager single-numa-node + pod scope (strict NUMA admission, pod-wide grouping)
  3. Memory Manager Static (enforce memory locality via cpuset.mems for Guaranteed pods)

And then: verify from inside the container (Cpus_allowed_list, Mems_allowed_list) and watch for TopologyAffinityError when policies can’t be satisfied.

References & Further Reading