Configuring vCenter Server High Availability (vCenter HA): Eliminating the Single Point of Failure in Your vSphere Environment

VMware tutorial - IT technology blog
VMware tutorial - IT technology blog

After vCenter went down in the middle of a customer demo — leaving the entire datacenter without centralized management for 45 minutes — I made a decision: never again. vCenter HA is something I should have configured from day one, not after an incident.

This article is based on hands-on experience deploying vCenter HA in production with VCSA 8.0.2, after six months of real-world operation — including one unplanned failover at 2 AM.

Quick Start — Enable vCenter HA in 5 Minutes

Before diving into the theory, here’s the fastest way to check whether your environment meets the prerequisites:

Check Prerequisites

  • vCenter Server Appliance (VCSA) 6.5 or later — Windows vCenter is not supported
  • vSphere Enterprise Plus or vSphere+ license
  • At least 3 ESXi hosts to distribute the 3 HA nodes across separate physical hosts
  • A dedicated IP range for the HA network, isolated from the management network

Quick Configuration Steps

  1. Log in to vSphere Client → click your vCenter Server in the inventory
  2. Go to Configure → vCenter HA
  3. Click Set Up vCenter HA
  4. Select Basic — the wizard automatically creates the Passive node and Witness node
  5. Enter IPs for the HA network, Passive node, and Witness node
  6. Click Finish — allow approximately 30–45 minutes for cloning and sync

That covers the basic setup. But if you want to truly understand what’s happening under the hood — and why you should test failover before going to production — keep reading.

How vCenter HA Architecture Works

vCenter HA creates a 3-node cluster from your VCSA:

  • Active node: Your current VCSA, handling all vSphere management workloads
  • Passive node: A clone of the Active node that receives a real-time data stream over the HA network, always ready to take over
  • Witness node: A lightweight “tiebreaker” node (2 vCPU, 8 GB RAM) used solely to break tie votes in a split-brain scenario

When the Active node fails, the Passive node automatically promotes itself to the new Active — typically within 4–8 minutes, with no manual intervention required. The vCenter Virtual IP (VIP) shifts to the Passive node, and all clients reconnect normally.

The HA Network — The Most Overlooked Part

The area where I see the most misconfiguration is networking. vCenter HA requires two separate networks:

  • Management network: The standard network you already have
  • HA network: A dedicated network for replication heartbeats — high bandwidth, low latency

Using a shared network, VCHA will still function, but fault detection is slower and replication can be impacted when the management network is busy. Keeping them separate makes everything significantly more stable.

Resource Allocation for the 3 Nodes

Active and Passive share the same specs since Passive is a clone. The Witness is much lighter:

# Default Witness node specs
vCPU: 2
RAM: 8 GB
Disk: ~10 GB (stores metadata only, not actual data)

My standard recommendation: place the 3 nodes on 3 different physical ESXi hosts — ideally across 3 separate racks or independent power circuits. If the entire cluster sits on one rack that loses power, HA won’t save you.

Advanced Configuration

Advanced Mode: Manual Control Over Node Placement

Instead of Basic, use Advanced when you need to:

  • Manually pin Passive and Witness to specific hosts (to prevent DRS from migrating them)
  • Use separate datastores for each node
  • Manually configure IPs to match your organization’s existing subnets

Check VCHA Status via API

# Get session token
SESSION=$(curl -sk -X POST \
  -u [email protected]:YourPassword \
  https://vcenter.lab.local/api/session \
  -H "Content-Type: application/json" | tr -d '"')

# Check the health state of the VCHA cluster
curl -sk \
  https://vcenter.lab.local/api/vcenter/vcha/cluster \
  -H "vmware-api-session-id: $SESSION" | python3 -m json.tool

The output returns the current mode (ENABLED/DISABLED/MAINTENANCE), the active node IP, and the health state. I’ve integrated this into Nagios to alert when the cluster is not in a HEALTHY state.

Proactive Failover Testing — Mandatory Before Production

This is a step I require immediately after setup, before any handoff:

# Trigger a manual failover via vSphere Client:
# Configure → vCenter HA → Initiate Failover

# Or via VCHA CLI on the Active node (SSH into VCSA)
/usr/lib/vmware-vcha/bin/vcha-vmafd-util --mode FAILOVER

The first time I tested, I discovered a firewall was blocking port 8182 between the Passive and Witness nodes — something that only surfaces when you actually trigger a failover, not during setup. If you skip the test, you’ll find out at the worst possible moment.

Practical Tips from 6 Months in Production

1. Automated Health Monitoring Script

#!/bin/bash
# Run every 5 minutes via cron to monitor VCHA health
VCENTER="vcenter.lab.local"
SESSION=$(curl -sk -X POST -u admin:pass \
  https://$VCENTER/api/session \
  -H "Content-Type: application/json" | tr -d '"')

STATUS=$(curl -sk \
  https://$VCENTER/api/vcenter/vcha/cluster \
  -H "vmware-api-session-id: $SESSION" | \
  python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('health_state','UNKNOWN'))")

if [ "$STATUS" != "HEALTHY" ]; then
  echo "ALERT: vCenter HA status is $STATUS" | \
    mail -s "[CRITICAL] vCHA Alert" [email protected]
fi

2. The Correct Patching Process

When patching vCenter, order matters — get it wrong and you’re looking at hours of restore work:

  1. Go to Configure → vCenter HA → Enter Maintenance Mode
  2. Update the Active node via VAMI (https://vcenter:5480)
  3. Verify normal operation after the update
  4. Exit Maintenance Mode — VCHA will automatically sync the Passive node

If you update while in ENABLED mode and VCSA restarts mid-process, you risk a split-brain scenario. I’ve experienced this once and spent 2 hours restoring the cluster to a healthy state.

3. Real-World HA Network Bandwidth

In an environment with around 200 VMs, replication traffic typically sits at 50–100 Mbps at idle and can spike to 300–400 Mbps after significant changes like bulk VM deployments or concurrent vMotion operations. Ensure the HA network has at minimum 1 Gbps — ideally 10 Gbps for larger environments.

4. VCSA Backups Are Still Required

vCenter HA protects against host failure but does not replace backups. If the VCSA database becomes corrupted, both Active and Passive will be corrupted since they sync in real time. Scheduled backups remain essential:

# Configure backups via VCSA VAMI
https://vcenter.lab.local:5480
# → Backup → Schedule Backup
# Supported protocols: FTP, FTPS, HTTP, HTTPS, SCP

5. Comparing with Proxmox When I Migrated My Home Lab

When I migrated from VMware to Proxmox for my personal lab to cut licensing costs, I noticed something interesting: Proxmox uses cluster quorum (Corosync) at the hypervisor layer, but the Proxmox VE Manager — the management layer — remains a single point of failure unless you build HA for it yourself. That comparison helped me understand exactly why VMware bakes HA into the management layer: it’s what enterprise environments genuinely need when SLAs don’t allow vCenter to be down for even 30 minutes.

When vCenter HA Isn’t Necessary

Not every environment needs it:

  • Small labs, dev/test environments with no committed SLA
  • Only 1–2 ESXi hosts — not enough to safely distribute 3 nodes
  • Budget constraints that rule out an Enterprise Plus license

In those cases, a thorough VCSA backup strategy with an acceptable RTO — restoring from backup takes 1–2 hours — is a more practical solution. But if you’re running a datacenter with a 99.9%+ SLA, vCenter HA is worth the investment from day one, not after an outage.

Share: