Configuring VMware vSphere Fault Tolerance (FT): Achieving Zero Downtime for Mission-Critical Systems

VMware tutorial - IT technology blog
VMware tutorial - IT technology blog

When HA Isn’t Enough: A 2 AM Story

2 AM, my phone wouldn’t stop ringing. A client’s ERP system was reported down. Checking the logs, I saw a Host in the Cluster had a RAM error and rebooted suddenly. The vSphere High Availability (HA) mechanism worked as designed: it automatically triggered a restart of the virtual machines (VMs) on another healthy Host.

However, the problem was the 180-second wait. 3 minutes for Windows to boot, SQL Server to start services, and applications to reconnect is far too long. In those 3 minutes, hundreds of transactions were stuck, and the CEO had already called to “check-in.” For mission-critical systems, HA still has gaps. That’s why you need Fault Tolerance (FT) – a true Zero Downtime solution.

Distinguishing Virtual Machine Protection Levels

Many folks often confuse these solutions. Let’s review them to avoid choosing the wrong “remedy” for your system.

1. Traditional Backup

This is the most basic backup option. If a VM fails, you restore from a backup. Downtime (RTO) is usually measured in hours. Data loss (RPO) depends on the backup schedule (usually 1 day). This method is only for servers that aren’t time-sensitive.

2. vSphere High Availability (HA)

HA protects at the physical Host level. When a Host dies, the VM is powered back on on another Host. The system will be interrupted for about 2-5 minutes because the VM must go through the OS boot process. This is the standard for 90% of systems today due to its cost-effectiveness.

3. vSphere Fault Tolerance (FT)

FT creates a copy (Secondary VM) that runs in parallel with the main version (Primary VM) on a different Host. These two VMs run in a “Lockstep” mechanism – whatever the Primary does, the Secondary does exactly the same. If the Primary fails, the Secondary takes over immediately. End users won’t notice any disruption.

Using FT: It’s Not All Sunshine and Rainbows

As great as it is, FT isn’t a “silver bullet” for every problem. There are 3 major hurdles you need to consider:

  • Double the resource consumption: A VM with FT enabled will occupy double the CPU and RAM in the Cluster. In reality, you are running two VMs for the same job.
  • Bandwidth pressure: Synchronization data (FT Logging) is extremely large. A heavy disk-writing task can push network traffic up to several Gbps. If the network is slow, VM performance will suffer significantly.
  • vCPU Limits: vSphere 8.0 currently supports a maximum of 8 vCPUs for a VM using FT. This is a fairly modest number for massive Database Servers.

Advice: Only use FT for “cannot-fail” services such as: Domain Controllers, Payment Gateways, or core Databases.

Prerequisites for Deploying FT

To configure FT smoothly without encountering “Incompatible” errors, check this list:

  1. CPU: Must be from the same family and support Hardware Virtualization. It’s best to use Hosts with identical CPUs to avoid instruction set errors.
  2. Networking: A dedicated network path for FT Logging is mandatory. I recommend using 10Gbps cards. With 1Gbps cards, you should only run a maximum of 1-2 FT VMs to avoid bottlenecks.
  3. Storage: Shared Storage (SAN/NAS) is mandatory. Both Hosts must see the VM files.
  4. License: You need vSphere Enterprise Plus to take full advantage of FT (multiple vCPUs).

Detailed Configuration Steps

Step 1: Setting up FT Logging

This is the lifeblood of FT. Without it, this feature cannot be activated.

  1. In vCenter, select Host -> Configure -> Networking -> VMkernel adapters.
  2. Select Add Networking -> VMkernel Network Adapter.
  3. In the Port properties section, check the Fault Tolerance Logging box.
# Pro tip: If possible, separate FT Logging and vMotion.
# Running them together on a 1Gbps port can easily cause system hangs during large data syncs.

Step 2: Cleaning up the VM before enabling

Right-click the VM -> Compatibility -> Run Compatibility Check. FT will refuse to work if:

  • The VM still has Snapshots (delete them all before starting).
  • An ISO file is mounted from a personal computer to the VM’s CD-ROM.
  • The CPU/RAM Hot-plug feature is set to Enabled.

Step 3: Activating Fault Tolerance

The final steps are quite simple:

  1. Right-click the VM -> Fault Tolerance -> Turn On Fault Tolerance.
  2. Select the Datastore to store files for the Secondary VM.
  3. Select the secondary Host.
  4. Click Finish and monitor the synchronization progress bar.

Once completed, the VM icon will turn dark blue. The Summary tab will now show the status as “Protected”.

The “Heart Attack” Test: Pulling the Server Power Cord

To verify the value of FT, try a real-world test. Open CMD on your computer and ping the VM’s IP continuously:

ping 192.168.1.100 -t

Now, suddenly shut down or unplug the network cable of the Host containing the Primary VM. If configured correctly, the ping command will not drop a single packet. Latency might increase slightly for a second, but the service remains seamless. That is the difference between “restarting” and “continuous running.”

Field Experience

After many deployment projects, I’ve gathered some vital notes:

  • Memory Reservation: FT requires a full RAM reservation (Full Reservation). If the VM has 32GB of RAM, the Host will occupy exactly that 32GB, not allowing it to be shared (Overcommit). Carefully calculate the Cluster’s total RAM.
  • Network Latency: If latency between Hosts is high (>1ms), application performance inside the VM will noticeably decrease. Prioritize using high-speed core switches for FT Logging.
  • Maintenance: When you need to update patches for a Host, use the “Suspend FT” feature instead of turning it off completely to save reconfiguration time later.

Hopefully, these insights help you feel more confident when deploying vSphere FT. If you encounter any errors during configuration, don’t hesitate to leave a comment, and I’ll help you out!

Share: