When Backup Alone is Not Enough
If you only rely on traditional backups, you’ll break a sweat when an entire data center goes down. I once witnessed an industrial park in Binh Duong face a power station failure, paralyzing the datacenter for over 12 hours. Even with backups on hand, manually restoring 200 virtual machines (VMs) to a secondary site was an impossible mission. RTO spiked to 24 hours, far exceeding the 4-hour SLA commitment. That was a costly lesson that led me to immediately deploy VMware Site Recovery Manager (SRM).
I am currently managing a cluster of 8 ESXi hosts running vSphere 7.0 U3. SRM doesn’t copy data directly; it acts like a “commander-in-chief” coordinating the recovery. Instead of staying up all night manually powering on VMs, changing IPs, or checking boot orders, you only need a single click for SRM to automate the entire process.
Core Concepts to Master
To avoid confusion during configuration, remember these 4 components:
- Protected & Recovery Site: Simply put, the Main Site (active) and the Backup Site (on standby).
- vSphere Replication (VR): The “right hand” that pushes VM data to the other site. If you use high-end SANs, you can leverage Array-Based Replication.
- Protection Groups: Grouping related VMs together. For example: A cluster consisting of the Database and App for accounting software must go together.
- Recovery Plans: A detailed “script.” It determines which VM starts first, which one follows, and how IPs are modified.
Real-World Deployment Workflow
1. Infrastructure Preparation
You need vCenter at both ends. My advice is to install matching SRM versions to avoid silly API errors. Don’t forget to open critical ports such as 8123 (VR traffic), 44046 (VR management), and 636 (SRM communication). If the firewall blocks these ports, site pairing will fail immediately.
2. Installation and Pairing
SRM now runs as an Appliance (Photon OS), so installation is very fast, taking about 15 minutes. Once installed, proceed with “Site Pairing.” This is the handshake step to establish trust between the two vCenters.
# Quick check of SRM connection via PowerCLI
Connect-SrmServer -RemoteServer <Secondary_vCenter_IP>
$srmApi = $global:SrmConnection.Extension
$srmApi.Runtime.ConnectionStatus
3. Configuring Inventory Mappings
This is the most error-prone step. Mappings help SRM understand: “If a VM on Site A uses VLAN 10, it must jump to VLAN 110 on Site B.” Carefully review 4 items: Network, Folder, Resource, and Storage Policy. A single mapping mismatch can leave a VM “orphaned” from the network or resource-starved when powered on at the recovery site.
4. Creating Protection Groups and Recovery Plans
Once data synchronization is complete, add the VMs to Protection Groups. I usually categorize them by Priority Groups. The Database must be Priority 1, starting first to be ready for service. Next comes the Application (Priority 3), and finally the Web Server (Priority 5). This prevents applications from throwing “Connection Timeout” errors when everything reboots simultaneously.
Testing – Don’t Wait for the Fire to Look for the Hose
A DR system that isn’t tested periodically is just a system “on paper.” The highlight of SRM is the Test Recovery feature. It creates an isolated network environment (Bubble Network) at the recovery site. SRM will clone VMs from the latest replica to test-run without affecting the live system at the protected site.
At my company, the entire team runs a test every quarter. During one test, we discovered that software licenses were locked because the virtual hardware IDs changed, or internal DNS hadn’t updated fast enough. Thanks to early detection, we added handling scripts to the Recovery Plan to ensure everything runs smoothly during a real incident.
Hard-Earned Lessons to Avoid Sleepless Nights
After several projects, I’ve drawn 3 important notes:
- DNS is key: When a VM changes its IP, DNS must follow. Use Dynamic DNS update scripts in the post-power on steps.
- Replication Bandwidth: Don’t underestimate this. If the data change rate (Churn rate) is 100GB/day but your connection is only 10Mbps, your RPO will lag by hours.
- vCenter Licensing: Set up alarms for when SRM licenses are about to expire. If the license expires, protection for VM groups will disconnect immediately.
Conclusion
VMware SRM might seem a bit confusing during the initial Mapping phase. However, once mastered, it gives you immense power. It transforms a risky manual recovery process into a precise automated script. Investing time in SRM today is the best way to protect your IT team’s reputation when critical failures occur.

