UK’s trusted IT infrastructure partner since 2003
Servnet
ConfiguratorGet in Touch
NUMA explained: why server topology decides your real performance (UK 2026) — analysisNUMA explained: why server topology decides your real performance (UK 2026) — analysis — reach
Components · Processors

NUMA explained: why server topology decides your real performance (UK 2026)

Servnet Editorial · Server Infrastructure Practice11 min read

Two servers with identical core counts, identical memory and identical clock speeds can post very different real-world numbers, and the usual reason is NUMA. Non-Uniform Memory Access is the single most common cause of a host that looks fast on the spec sheet but disappoints under a database or a busy virtual machine. It is not a setting you switch on, it is the physical reality of how a multi-socket server is wired, and the workloads that ignore it pay a quiet latency tax on every memory access. This guide explains what NUMA is, how to size virtual machines and applications to stay local, and where sub-NUMA clustering helps or hurts.

Local vs remote memory access across sockets
locallocalremoteremoteCPU 0NUMA node 0Local DIMMsfastCPU 1NUMA node 1Local DIMMsfast

What NUMA actually is

In a modern server each CPU has its own set of memory channels and the DIMMs attached to them. A processor reaches the memory on its own channels quickly, and the memory attached to another processor more slowly, because that access has to traverse the interconnect between the sockets. That asymmetry is what non-uniform memory access means: not all memory is equally far away. Each socket plus its local memory is a NUMA node, and the gap between a local and a remote access is the cost you are trying to avoid.

This is not exotic. Any dual-socket server is a two-node NUMA machine by definition, and even some single-socket processors present multiple NUMA domains internally. The point is that the topology is fixed by the hardware, so performance depends on whether the software keeps its threads and the memory they touch on the same node. Choosing the CPU and socket count with this in mind is part of a good build, covered in our server processors guidance.

The remote-access tax, and why it is invisible

A remote memory access is not catastrophically slow, which is exactly why NUMA problems are so hard to spot. It is meaningfully higher latency and lower effective bandwidth than a local access, and on a latency-sensitive workload like an OLTP database those small penalties, repeated billions of times, add up to a throughput hit you cannot explain from the spec sheet. Nothing errors, nothing alerts; the host is simply slower than its components suggest.

The fix is locality. The operating system and hypervisor both try to keep a process and its memory on the same node, but they can only do so if the workload fits. A virtual machine sized larger than a single NUMA node, or a database left to sprawl across both sockets, forces remote access no matter how clever the scheduler is. Right-sizing is therefore the lever, and it interacts directly with how you populate memory, which our memory guidance covers.

  • Each socket plus its directly attached DIMMs is a NUMA node; remote access crosses the inter-socket link
  • Remote access is higher-latency and lower-bandwidth, not an error - so it hides on the spec sheet
  • OLTP databases and latency-sensitive VMs suffer most; throughput-bound batch jobs tolerate it better
  • Locality is the goal: keep threads and the memory they touch on the same node

Sizing virtual machines to stay NUMA-local

The practical rule for virtualisation is to size a virtual machine so it fits inside one NUMA node wherever you can. A VM whose vCPU count and assigned memory both sit within the cores and capacity of a single socket lets the hypervisor place it cleanly and keep its memory local. A VM that spans two nodes becomes a wide virtual NUMA machine, and unless the guest operating system and application are themselves NUMA-aware, you inherit remote-access latency inside the guest.

When a workload genuinely needs more than one node, the answer is to expose the topology, not hide it. A correctly presented virtual NUMA layout lets a NUMA-aware database inside the guest make its own local-placement decisions. The mistake is a large flat VM that the guest believes is uniform when the hardware underneath is not. This is why we size hosts backwards from the VMs they run, the same discipline described in our how to spec a server in 2026.

Size a VM for NUMA locality
Does the VM fit inside one NUMA node?
Fits one node
Keep it local - best latency
Needs more
Expose real virtual NUMA
Flat + large
Remote access tax - avoid

Sub-NUMA clustering: a sharper knife

High-core-count processors add a further wrinkle. A single physical socket can be configured to present itself as several smaller NUMA domains, a feature Intel calls sub-NUMA clustering and AMD exposes through its NUMA-per-socket options. Splitting a socket this way shortens the path from a core to its nearest slice of the on-chip memory and cache, which can lift performance for workloads that are tightly pinned and latency-sensitive.

It is not a free win. Sub-NUMA clustering only helps when the software is NUMA-aware enough to exploit the finer-grained domains; applied to a workload that sprawls, it can make placement worse and add remote hops within a single socket. Treat it as a tuning option for known, pinned workloads, validated on a representative node, rather than a default. Getting the platform and these firmware options right from the start is part of our server configuration service.

Designing for topology from the start

NUMA is a design input, not an afterthought. Pick the socket count deliberately, populate memory so every node is balanced, size virtual machines to fit a node where you can, expose real virtual NUMA when they cannot, and only reach for sub-NUMA clustering on workloads that will actually use it. Done together those choices close the gap between a server's spec sheet and its real performance. Choose the processors with our processor guidance, plan the memory with our RAM guidance, and build the host in our configuration service.

Key takeaways
  • NUMA is the fixed hardware reality that local memory is faster than memory attached to another socket.
  • Remote access is a silent latency tax - no error, no alert - that hits OLTP and latency-sensitive VMs hardest.
  • Size virtual machines to fit one NUMA node where possible; expose real virtual NUMA when they must span nodes.
  • Sub-NUMA clustering helps tightly pinned, NUMA-aware workloads and can hurt ones that sprawl - validate first.
  • Treat topology as a design input: choose socket count, balance memory and right-size guests deliberately.
Frequently asked

FAQs — NUMA explained

NUMA basics

What does NUMA mean in plain terms?

Non-Uniform Memory Access means a CPU reaches its own attached memory faster than memory attached to another socket. Each socket plus its DIMMs is a NUMA node, and crossing between them costs latency. It is fixed by the hardware, so software has to stay local to benefit. See our processor guidance.

Why is my dual-socket server slower than expected?

Often because a workload is accessing remote memory across the inter-socket link. The penalty is small per access but repeated constantly, so it never errors and is easy to miss. Sizing VMs to fit a single NUMA node usually recovers the lost performance. More in our spec guide.

Tuning

Should I enable sub-NUMA clustering?

Only for tightly pinned, NUMA-aware workloads that will use the finer-grained domains, and only after validating on a representative node. Applied to a workload that sprawls across cores it can add remote hops within one socket and hurt performance. We set these options in our configuration service.

Related

Got a question this article didn't answer?

One conversation with an engineer who's done this before. No sales script.

Talk to Servnet →