VMware VM Performance Optimization: A Real-World Checklist from Production – ITFROMZERO

After six months managing a VMware cluster with 8 ESXi hosts, I’ve put together a checklist that most online documentation overlooks. VM running slow, abnormally high CPU usage, disk I/O lag? This post goes straight to debugging and fixing each issue step by step — no lengthy theory.

Table of Contents

Quick Start: 3 Things to Do Right Now in 5 Minutes

Before diving deep into configuration, there are 3 things to check immediately when a VM has performance issues.

1. Install VMware Tools (if not already installed)

This is something I run into every time I take over infrastructure from another team. A VM without VMware Tools is like a car without oil — it runs, but wears down faster and performs noticeably worse. Tools provide paravirtual drivers for memory, the balloon driver, and accurate time synchronization.

On Ubuntu/Debian:

sudo apt update && sudo apt install open-vm-tools open-vm-tools-desktop -y
sudo systemctl enable open-vm-tools && sudo systemctl start open-vm-tools

On CentOS/RHEL/Rocky Linux:

sudo yum install open-vm-tools -y
sudo systemctl enable vmtoolsd && sudo systemctl start vmtoolsd

# Confirm it's running
vmware-toolbox-cmd -v

2. Check Whether the Balloon Driver Is Inflating

VMware uses the balloon driver to reclaim RAM from VMs when the host is running low on memory. If the balloon is inflating, the VM is losing RAM without any visibility from inside the guest OS — this is the number one cause of slowdowns that are hard to diagnose from within the guest.

Check in vSphere Client: VM → Monitor → Memory → Balloon. A Balloon value > 0 MB means the host is short on RAM and is reclaiming memory from your VM.

3. Switch to PVSCSI and VMXNET3

In VM Settings, change:

Hard Disk adapter: from LSI Logic SCSI → VMware Paravirtual (PVSCSI)
Network Adapter: from E1000 → VMXNET3

Concrete result: CPU usage on one of our database VMs dropped from 15% to 8% just by switching the disk adapter. PVSCSI and VMXNET3 are paravirtual drivers — VMware handles I/O directly without emulating physical hardware, so overhead is dramatically lower compared to LSI or E1000.

Deep Dive: CPU and RAM

Adding vCPUs Can Sometimes Make a VM Slower

This sounds counter-intuitive, but it’s something I’ve had to explain to the team more than a few times. VMware uses a co-scheduling mechanism: to run a VM with multiple vCPUs, the hypervisor must find exactly that many idle physical cores simultaneously. Assign 8 vCPUs to an app that only uses 2 threads? VMware still waits for 8 cores to be free at the same time — CPU scheduling delay spikes.

The metric to watch is CPU Ready Time — how long the VM has to wait before it gets scheduled to run:

# SSH into the ESXi host and run esxtop
esxtop

# Press 'c' to enter CPU view
# %READY column:
#   < 5%  = good
#   5-10% = acceptable, needs monitoring
#   > 10% = problem — reduce vCPUs or migrate the VM

# Export batch data for analysis:
esxtop -b -d 5 -n 12 > cpu_report.csv

The configuration I apply across the 8-host cluster:

Web/App servers: 2–4 vCPUs
Database servers: 4–8 vCPUs, but CPU Ready must be monitored closely
Build servers/CI: can go higher — batch workloads, low latency not required

RAM — Letting a VM Swap Destroys Performance

A rule I remind the team during every infrastructure review: VMware swap is hundreds of times slower than physical RAM. When the host runs out of RAM, VMware swaps VM pages to disk — performance collapses immediately, and no amount of configuration tuning can fix it.

# Check whether the VM is swapping (from inside the Linux guest)
vmstat 1 5
# 'si' (swap in) and 'so' (swap out) columns > 0 are bad signs

# Or get a quick overview
free -h
cat /proc/meminfo | grep -E 'SwapTotal|SwapFree|SwapCached'

When the host is short on RAM, address it in order: reduce RAM reservations on less critical VMs first → vMotion VMs to hosts with available memory → purchase additional physical RAM. Follow that order — don’t jump straight to step 3.

Advanced: Disk I/O and Snapshots

Noisy Neighbor — The Real Culprit Behind Slow Disks

The number one bottleneck I encounter in production isn’t CPU or RAM — it’s disk I/O contention between VMs sharing the same datastore. One VM running a backup or rebuilding a database index at 2 AM can drag down a dozen other VMs on the same LUN. Users notice the web app is slow in the early morning, but nobody can figure out why.

# Check disk latency on ESXi (esxtop)
esxtop
# Press 'd' to enter disk/storage view

# DAVG column (device average latency in ms):
#   < 5ms  = good
#   5-20ms = acceptable
#   > 20ms = serious problem, needs investigation

# KAVG: kernel latency — if high, issue is in hypervisor scheduling
# GAVG = DAVG + KAVG: total latency from the VM's perspective

For database or I/O-intensive VMs, there are three things I always apply:

Place on a dedicated datastore, not shared with other VMs
Use Thick Provision Eager Zeroed instead of Thin Provision — avoids zero-fill overhead on first writes
Enable Storage I/O Control (SIOC) at the datastore level so VMware can automatically balance I/O across VMs

Long Snapshot Chains — The Slowest Performance Killer

I’ve inherited more than a few systems with 15–20 snapshot chains. Disk I/O is terrible because VMware has to read through multiple layers of delta files on every write. The worst case I’ve seen: DAVG hitting 80ms just because of an 18-snapshot chain that had been accumulating for 8 months.

# Check snapshots via PowerCLI
Connect-VIServer -Server vcenter.company.local
Get-VM "VMName" | Get-Snapshot | Select Name, Created, SizeGB | Format-Table

# List all VMs with snapshots in the cluster
Get-VM | Get-Snapshot | Select VM, Name, Created, SizeGB | Sort-Object SizeGB -Descending

Snapshots should only be created before patching or upgrades, and deleted immediately after verifying everything is stable. Using snapshots as long-term backups is the wrong use case — that’s what Veeam or Nakivo are for.

Practical Tips: Monitoring and CPU/RAM Reservations

When to Use Reservations

VMware allows you to set a Reservation to guarantee minimum resources for a VM. I only set reservations for production database VMs and real-time workloads — where resource contention is immediately visible to end users.

Don’t set reservations indiscriminately. They lock up resources, make it harder for DRS (Distributed Resource Scheduler) to balance load across hosts, and visibly reduce the cluster’s total usable capacity.

Configuration by Workload Type

After six months of operations, I categorize VMs into three groups:

Web/App servers: VMXNET3 + PVSCSI + Thin Provision + 2–4 vCPUs, no Reservation needed
Database servers: PVSCSI + Thick Eager Zeroed + RAM Reservation + continuous balloon and swap monitoring
Build servers/CI: high vCPU count is fine, Thin Provision is fine, no Reservation needed — can tolerate higher latency

Applying these consistently across the 8-host ESXi cluster brought average CPU ready time down from 8% to under 2%, with disk latency stable below 5ms. No magic involved — just picking the right drivers, keeping snapshot chains short, and looking at the right metrics. If you’re dealing with a specific VMware issue, drop it in the comments — I’m happy to share what I’ve actually seen in the field.