CPU Pinning and NUMA Topology on Proxmox VE: Optimizing Virtual Machine Performance for Low Latency – ITFROMZERO

Table of Contents

The Problem: VM Running Slowly Despite Available CPU

I first noticed this while running a VM dedicated to real-time data processing — host CPU usage was sitting around 30%, yet latency kept spiking unpredictably in short bursts before settling back down. Nothing suspicious in the logs, RAM was fine, disk I/O was normal. It took a while to find the culprit: the hypervisor scheduler was constantly migrating the VM’s vCPUs across different physical CPU cores, causing cache misses and NUMA penalties.

I run a homelab with Proxmox VE managing 12 VMs and containers — it’s my playground for testing everything before pushing to production. That environment taught me something important: for latency-sensitive workloads — databases, game servers, real-time audio/video processing — letting the Linux kernel decide which core each vCPU runs on just isn’t good enough.

The solution lies in combining CPU Pinning with NUMA-aware topology. It’s not complicated to configure — but misunderstanding the fundamentals makes it easy to apply in the wrong way.

Core Concepts

What Is CPU Pinning?

When you create a VM with 4 vCPUs, Proxmox (really QEMU + KVM) creates 4 corresponding threads. By default, the host’s Linux kernel schedules these threads on whichever physical cores happen to be free. Every time the scheduler migrates a thread to a different core, the old core’s CPU cache goes cold — the new core has to reload all the data from scratch. For typical workloads, this overhead is negligible. But for latency-sensitive workloads, each cache miss compounds into a real problem — especially when it happens thousands of times per second.

CPU Pinning (also called CPU affinity) means hard-binding each vCPU in a VM to a fixed subset of physical CPU cores. That vCPU’s thread will only ever run on the designated cores, no wandering around. The cache stays warm, latency stays stable.

What Is NUMA and Why Does It Matter?

Multi-socket servers — or CPUs with multiple chiplets like AMD EPYC and Ryzen Threadripper — don’t share memory uniformly. Each socket has RAM directly attached to it, called local memory, which is accessed very quickly. In contrast, when a core on socket 0 needs to read data from socket 1’s RAM, it has to traverse the inter-socket interconnect (Intel QPI or AMD Infinity Fabric). This is remote memory — latency is 2–3x higher than local access.

This architecture is called NUMA — Non-Uniform Memory Access. The problem arises from misconfiguration: if you pin vCPUs to cores on node 0 but RAM gets allocated from node 1, you still pay the NUMA penalty on every memory access. A half-measure can sometimes be worse than doing nothing — because you think the problem is solved while the bottleneck is still there.

A correct configuration requires synchronizing both: pin vCPUs to cores within a single NUMA node, and ensure the VM’s RAM is allocated from that same NUMA node. This is just one layer of optimizing KVM/Proxmox resources — but it’s the one that matters most for latency-sensitive workloads.

Step-by-Step Walkthrough

Step 1: Inspect the Host CPU Topology

Before pinning anything, you need to know how many NUMA nodes the host has and which cores belong to which node.

# View overall NUMA topology
numactl --hardware

# Or use lscpu for more detailed information
lscpu | grep -E 'NUMA|CPU\(s\)|Thread|Core|Socket'

# List CPUs belonging to each NUMA node
cat /sys/devices/system/node/node0/cpulist
cat /sys/devices/system/node/node1/cpulist

Example output from numactl --hardware on a 2-socket server with 8 cores per socket (16 threads with HT):

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 64432 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 64432 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

The number 21 is the NUMA distance — accessing RAM from the other node costs 2.1x more. Keep in mind: if a VM uses cores 0–7, its RAM must come from node 0 to avoid the penalty.

Step 2: Configure CPU Pinning in Proxmox

Proxmox has no direct UI for CPU pinning (as of PVE 8.x). You need to edit the VM configuration file directly — if you’re not yet familiar with creating and managing virtual machines with Proxmox VE, it’s worth reviewing that first:

# Configuration file for VM with ID 100
nano /etc/pve/qemu-server/100.conf

Add the following lines (example: VM with 4 vCPUs, pinned to cores 0–3 on NUMA node 0):

# Number of vCPUs
cores: 4
sockets: 1

# CPU type — host passes through all physical CPU features
cpu: host

# Pin each vCPU (vcpu0 → core0, vcpu1 → core1, ...)
affinity0: 0
affinity1: 1
affinity2: 2
affinity3: 3

From Proxmox VE 8.1+, you can use the affinity parameter in range notation for brevity:

affinity: 0-3

After saving, restart the VM to apply the changes. Verify with:

# Get the PID of the QEMU process for VM 100
ps aux | grep 'qemu.*100'

# Check current affinity of QEMU threads
taskset -cp <PID>

Step 3: Configure NUMA Topology for the VM

CPU pinning alone gets you halfway there. The next step is declaring the NUMA topology — so the VM is aware of its NUMA environment and QEMU/KVM is forced to allocate RAM from the correct node.

Edit the VM configuration file again:

# Enable NUMA emulation
numa: 1

# Declare NUMA node 0 for the VM:
# cpus = which vCPUs belong to this node (0-3)
# memory = amount of RAM (MB) allocated to this node
# hostnodes = which physical NUMA node supplies the RAM
# policy = memory allocation policy
numa0: cpus=0-3,memory=8192,hostnodes=0,policy=bind

policy=bind is the critical parameter: it forces the host kernel to allocate RAM exclusively from hostnodes=0, even under memory pressure — borrowing from another node is not permitted.

If the VM has more vCPUs and needs to span NUMA node 1 (for example, an 8-vCPU VM on a 2-socket server):

cores: 8
sockets: 2
numa: 1
affinity: 0-3,8-11
numa0: cpus=0-3,memory=8192,hostnodes=0,policy=bind
numa1: cpus=4-7,memory=8192,hostnodes=1,policy=bind

Note: affinity: 0-3,8-11 means vCPUs 0–3 use physical cores 0–3 (node 0) and vCPUs 4–7 use physical cores 8–11 (node 1). Cross-reference with the numactl --hardware output from step 1 to map them correctly.

Step 4: Isolate CPU Cores from the Host Scheduler (Advanced)

Even with pinning in place, if the host OS is still running other processes on those same cores, interruptions can still occur. For extreme low-latency requirements, you should fully isolate the cores reserved for the VM:

# Add to GRUB_CMDLINE_LINUX in /etc/default/grub
# Isolate cores 0-3 from the host Linux scheduler
isolcpus=0-3 nohz_full=0-3 rcu_nocbs=0-3

# Apply changes
update-grub
reboot

After rebooting, verify:

cat /sys/devices/system/cpu/isolated

It should return 0-3. Those cores are now almost exclusively dedicated to the pinned VM.

Step 5: Verify the Results

Inside the VM, run a simple latency benchmark:

# Install cyclictest (usually included in rt-tests)
apt install rt-tests

# Run latency test for 60 seconds
cyclictest --mlockall --smp --priority=80 --interval=200 --distance=0 -D 60

Compare results before and after configuring pinning. With a correct setup, the Max latency value typically drops significantly — from several milliseconds down to a few hundred microseconds under load.

Things to Keep in Mind Before Applying This

Don’t pin every VM: Pinning reduces scheduler flexibility. Only apply it to VMs that genuinely need low latency — let the scheduler handle the rest freely.
Budget your cores: On a 16-core host, if you pin 8 cores to 2 VMs, only 8 cores remain for the host OS and other VMs. Don’t over-commit.
Hyperthreading: For compute-heavy workloads, consider disabling HT or using only physical cores (not sibling threads) to avoid resource contention within the same physical core.
Live migration is affected: VMs with CPU pinning cannot live-migrate to a host with a different topology. You’ll need to unpin before migrating.

Conclusion

CPU pinning and NUMA topology aren’t something you need to configure for every VM. But when a workload genuinely demands consistent latency, ignoring these two techniques means leaving hardware performance on the table. I overlooked this for a long time, assuming it was only relevant for large data centers — until my own homelab proved otherwise.

Start with numactl --hardware, pick one VM to test, and pin it incrementally. Watch how latency changes — numbers from your own server, with your real workload, will be far more convincing than any synthetic benchmark.