Proxmox HA: The Ultimate System Rescue at 2 AM Without Leaving Your Bed – ITFROMZERO

Table of Contents

2 AM, Telegram notifications are exploding…

The monitoring system is screaming: Proxmox Node 1—hosting the production database for all company orders—just “died.” It could be a datacenter power surge or simply an ECC RAM module failing with a single-bit error. If you’re running Proxmox in standalone mode, your only option is to jump out of bed, log into iDRAC, or rush to the server room to save the machine. Meanwhile, your boss’s group chat starts “blowing up” as customers complain about the site being down.

This scenario is all too familiar for SysAdmins. I’ve experienced those sleepless nights myself when I first built a homelab with 12 VMs. From those “painful” lessons, I realized: if you’re running services for production, never put all your eggs in one basket.

Why isn’t your VM “auto-resurrecting”?

Many people mistakenly think that just connecting 2 or 3 Proxmox nodes into a cluster is enough. In reality, a cluster is just for centralized management. If the physical node holding the VM loses power, that VM remains stuck on that node’s local hard drive.

Basically, there are three barriers preventing your system from reaching High Availability (HA):

Lack of Quorum (Majority): In a cluster, nodes need to vote to determine who is still “alive.” With a 2-node cluster, if one node dies, the remaining node only has 50% of the votes, which is not a majority (50% + 1) to make decisions. As a result, the system falls into a “Split-brain” state and stops operating to protect data.
Local Storage: If the .qcow2 file is on Node 1’s internal SSD, what will Node 2 use to run the VM when Node 1 is completely powered off?
HA Manager not assigned: Proxmox has a dedicated ha-manager. If you don’t add the VM to the monitoring list, it will do nothing when an incident occurs.

From manual evacuation to full automation

1. Live Migration (Maintenance use only)

If you know a node needs maintenance, move the VM to another node without service interruption. But this is useless when a node crashes unexpectedly because the disk becomes inaccessible.

2. Recovery from Backup (Slow and exhausting)

Use Proxmox Backup Server (PBS) to restore the VM to a new server. This is safe but time-consuming. For a 500GB database, waiting for a restore could take over 40 minutes—too long for a system that needs to be back online immediately.

3. Cluster combined with Shared Storage

This is a big step forward. You use a NAS (TrueNAS) or run Ceph for shared storage. When Node 1 dies, Node 2 still sees the VM’s disk files over the network. However, you still have to log into the web interface to manually click “Start” on the VM.

The ultimate solution: Implementing true Proxmox HA

To sleep soundly every night, you need the trifecta: Cluster (at least 3 nodes), Shared Storage, and HA Groups. Here is the practical workflow I usually apply.

Step 1: Set up the Cluster and solve the Quorum problem

Never run a 2-node cluster for production. If budget is tight, repurpose an old Raspberry Pi or a mini PC as a “QDevice” to provide that crucial third vote.

Create the cluster from the command line on the Main Node:

pvecm create MY-CLUSTER

Join the remaining nodes:

pvecm add [MAIN-NODE-IP]

Check with the pvecm status command. Ensure the Quorum information line says Activity: ok.

Step 2: Configure Shared Storage

Whether using NFS, iSCSI, or Ceph, make sure all nodes are mounted to the same Storage ID. I often use NFS from a dedicated server running RAID 10 for optimized speed.

Go to Datacenter > Storage > Add > NFS. Remember to check all Nodes in the list.

Step 3: Set up the HA Group

Many people skip this step and add the VM directly to HA. Big mistake! HA Groups help you control where VMs prioritize jumping to. For example: Node 1 and 2 use powerful Xeon Gold CPUs, while Node 3 is an old server kept only for backup.

Go to Datacenter > HA > Groups > Create.
Name the group Critical-Services.
Select Nodes 1 and 2, setting a higher Priority than Node 3.
Tick Restricted if you want the VM to only run on these specific nodes.

Step 4: Activate HA for the Virtual Machine

Now, hand the VM over to the “manager”:

Go to Datacenter > HA > Resources > Add.
Select the VM ID of the critical database.
Select the Group you just created.
Max Restart: Set to 1 (attempt one local restart before failing over).

Real-world testing: The “unplugging” scenario

A good technician doesn’t just trust theory. Perform a test during off-peak hours: type poweroff directly on Node 1.

After about 60-120 seconds (depending on timeout configuration), Proxmox will mark Node 1 as dead. Immediately, ha-manager instructs Node 2 to take control of the hard disk and boot the VM. Check the status with the command:

ha-manager status

The state transitions from started to fence, then back to started on the new node. That is the moment your system saves itself.

Crucial Lessons:

Corosync Network: Should have its own NIC or VLAN. If this network lags, the cluster might misinterpret it as a dead node, causing constant reboots (Fencing).
Fencing: Proxmox uses a Watchdog mechanism to ensure a dead node has truly “stopped breathing” before another node starts the VM, preventing two nodes from writing to the same file and corrupting data.

Configuring HA is actually quite straightforward if you master the two pillars: Quorum and Shared Storage. Hopefully, this guide gives you the confidence to build a high-fault-tolerant system for your business.