Diagnosing VMware ESXi with esxtop: A Practical Troubleshooting Field Guide

VMware tutorial - IT technology blog
VMware tutorial - IT technology blog

Quick Start in 5 Minutes: Accessing and Navigating esxtop

When vCenter is sluggish or a VM appears frozen, sitting around waiting for the web interface charts to load only makes things worse. The fastest approach is to SSH directly into the ESXi host and run:

esxtop

The default view shows CPU metrics. To inspect other resources, use the following keyboard shortcuts:

  • c: View CPU performance.
  • m: Inspect Memory (RAM) details.
  • u: View Disk Device metrics (used to check storage latency).
  • n: Monitor Network traffic.
  • f: Add or hide data columns.
  • V: Show only VM rows, filtering out system background processes.

Pro tip: Once you’ve customized the columns to your liking, press W to save the configuration to .esxtop4rc. The next time you open esxtop, your layout will be restored immediately.

Analyzing CPU: Why High %USED Isn’t Always a Problem

Many administrators panic when they see the %USED column spike. In virtualized environments, however, %RDY (Ready Time) is the metric that truly deserves your attention.

The %RDY Metric: Measuring Wait Time

This value indicates how long a VM was ready to process but had to wait in queue for the ESXi host to allocate physical CPU resources.

  • Below 5%: The system is running smoothly.
  • 5% – 10%: Contention is beginning; users may notice slight application lag.
  • Above 10%: Critical threshold. This is typically a result of over-provisioning vCPUs on a VM.

I once dealt with a Web Server VM assigned 32 vCPUs that was running extremely slowly. Checking with esxtop revealed %RDY had shot up to 20%. The root cause: the CPU scheduler had to wait for 32 physical cores to be free simultaneously before the VM could run. After reducing the count to 4 vCPUs, performance was immediately smooth. Remember: fewer vCPUs can sometimes mean better performance. For a broader view of resource contention across your environment, managing CPU, RAM, and Disk I/O in VMware covers strategies to prevent noisy neighbors from impacting other VMs.

The %CSTP (Co-stop) Metric

If %CSTP exceeds 3%, the vCPUs within the same VM are falling out of sync. This typically happens when too many multi-vCPU VMs are packed onto a host with limited resources.

Decoding RAM: When Do You Actually Need More Hardware?

Press m to switch to the Memory tab. Look at the MEM Overcommit avg line at the top. If that number is greater than 0, your host is under physical memory pressure.

Here are the 3 critical metrics to watch closely:

  • MCTLSZ (MB): RAM reclaimed via the Balloon driver. If this is greater than 0, the host is borrowing memory from one VM to keep another alive.
  • SWCUR (MB): RAM currently being swapped to disk. Storage is hundreds of times slower than RAM, so any non-zero value here means VMs will experience freezes or severe lag.
  • ZIP/s (MB/s): RAM compression rate. ESXi compresses data to save space before resorting to swap. If this number is constantly fluctuating, it’s time to order more physical RAM.

Storage and Network: Tracking Down I/O Bottlenecks

Measuring Disk Latency

Press u to check Disk Device metrics. This is where you’ll find the root cause of slow database queries. Focus on these 3 columns:

  1. DAVG/cmd: Latency from the hardware side (Storage/SAN). If this exceeds 20ms, inspect the disks, fiber cables, or SAN switch configuration.
  2. KAVG/cmd: Latency introduced by the ESXi kernel. This should normally be close to 0. If it’s elevated (> 1ms), a RAID/HBA card driver or firmware issue is likely the culprit.
  3. GAVG/cmd: Total end-to-end latency experienced by the VM (DAVG + KAVG).

From experience, if you’re running SSD/NVMe storage and GAVG is still above 10ms, there is almost certainly a serious I/O configuration problem. In many cases, the underlying cause traces back to thin-provisioned disks — converting VMware thin to thick provisioning can dramatically reduce storage latency.

Checking for Network Congestion

Press n to view Network metrics. Instead of focusing on MB/s bandwidth, pay attention to the %DRPRX and %DRPTX columns (Dropped Packets). Any non-zero values here mean packets are being lost in transit. Possible causes include an overloaded physical switch or an outdated VMXNET3 driver inside the VM. If you’re managing multiple hosts and VLANs, a vSphere Distributed Switch (vDS) can simplify network policy enforcement and reduce misconfiguration errors.

Advanced: Running esxtop in Batch Mode

Many system issues only occur in brief bursts at odd hours. To capture these moments, use Batch mode to log data to a CSV file:

esxtop -b -d 5 -n 200 > performance_log.csv

This command records metrics every 5 seconds for 200 iterations. You can then download the file, open it in Excel or VisualEsxtop, and plot graphs to analyze performance trends over time.

Practical Advice for System Administrators

  • Don’t trust vCenter blindly: vCenter data is subject to collection delays. esxtop delivers real-time data with pinpoint accuracy.
  • The golden latency rule: Always keep GAVG below 15ms for general workloads and below 5ms for database systems (SQL, Oracle).
  • Adjust the refresh interval: Press s and enter 2 to refresh the display faster (the default is every 5 seconds).

Mastering esxtop means you’ll never have to guess why a system is slow again. It clearly distinguishes between hardware faults and software misconfiguration, giving you the data you need to make precise decisions about upgrades and optimization. For a complete picture of VM health over time, pairing esxtop with a solid VMware VM performance optimization checklist ensures you’re addressing both real-time spikes and long-term resource trends.

Share: