2 AM. My phone buzzes. A vCenter alert: a wave of VMs showing Not Responding. I open the console — datastores have turned yellow, vmkernel logs are lit up red. This is a scenario I’ve been through three times over the years managing a VMware cluster with 8 ESXi hosts at my company — and every single time, the first question is always: is this APD or PDL?
The two conditions look identical on the surface — VMs frozen, I/O timeouts, storage inaccessible — but the root causes and remediation steps are completely different. Confusing one for the other can cost you an extra hour of downtime, or worse, corrupt your data.
What’s the Difference Between APD and PDL?
All Paths Down (APD)
APD occurs when all paths from an ESXi host to a storage device are lost, but ESXi cannot yet determine whether the loss is temporary or permanent. The host keeps the I/O queue open, continues retrying, and places the device into an “APD timeout” state after exactly 140 seconds by default.
Common causes:
- Broken FC cable, failed SFP, unexpected SAN switch restart
- iSCSI network disruption (VLAN misconfiguration, NIC flap)
- Storage array temporarily offline due to unannounced maintenance
- Zoning changes on an FC switch
Permanent Device Loss (PDL)
PDL is a step more severe than APD. The storage array returns a SCSI sense code confirming the device no longer exists — specifically sense code 0x05/0x25 (logical unit not supported) or 0x02/0x3a (medium not present). ESXi receives this unambiguous signal and immediately knows: the loss is permanent, no need to wait for a timeout.
Situations that lead to PDL:
- A LUN is deleted or unmapped directly on the storage array
- Incomplete storage array failover
- A vSAN disk group fails entirely
- HBA driver crash causing the device to detach at the kernel level
Quick Comparison
- APD: Unclear whether temporary or permanent — ESXi waits 140s then enters timeout state, VMs hang, may self-recover when connectivity is restored
- PDL: Confirmed permanent immediately via SCSI sense code — instant response, VMs receive I/O errors, manual intervention required for recovery
Determining Whether You Have APD or PDL
Don’t guess. SSH into the affected ESXi host and run these commands immediately:
# Check path status to storage devices
esxcli storage nmp path list | grep -A5 "State"
# Find APD/PDL events in vmkernel log
grep -i "APD\|PDL\|permanent device\|all paths down" /var/log/vmkernel.log | tail -50
# Check specific device status
esxcli storage nmp device list
# View SCSI sense codes (critical for distinguishing PDL)
grep -i "H:0x0 D:0x2 P:0x0 Valid sense\|sense data" /var/log/vmkernel.log | tail -20
At the same time, open the vCenter Event Log and filter by the affected host. APD will show the event esx.problem.storage.apd.start, while PDL will show esx.problem.storage.permanentDeviceLoss.set. Seeing either one tells you exactly what you’re dealing with.
Resolving APD
Step 1: Check Physical Connectivity
Before touching software, check the hardware first. I once spent 20 minutes troubleshooting on the CLI before discovering that a datacenter operator had accidentally unplugged an FC cable. An expensive lesson.
# Check HBA port status
esxcli storage san fc list
# For iSCSI, check network connectivity
esxcli iscsi adapter list
esxcli iscsi session list
# Ping storage IP from VMkernel interface
vmkping -I vmk1 192.168.10.50
Step 2: Rescan Storage Adapters
Physical connectivity confirmed? Rescan so ESXi can rediscover the paths:
# Rescan all storage adapters
esxcli storage core adapter rescan --all
# Or rescan a specific adapter (e.g. vmhba1)
esxcli storage core adapter rescan --adapter vmhba1
After the rescan, VMs typically resume on their own if connectivity has been restored. No VM restart needed.
Customizing APD Timeout and Response
By default, ESXi waits 140 seconds before entering APD timeout. After that, HA can automatically power off hung VMs — instead of leaving them suspended indefinitely. Configure this via PowerCLI:
# Connect to vCenter
Connect-VIServer -Server vcenter.company.com
# Enable APD handling on all hosts in the cluster
# When APD timeout is reached, HA is allowed to power off hung VMs instead of waiting indefinitely
$cluster = Get-Cluster "Production-Cluster"
Get-VMHost -Location $cluster | ForEach-Object {
$esxHost = $_
Get-AdvancedSetting -Entity $esxHost -Name "Disk.APDHandlingEnable" |
Set-AdvancedSetting -Value 1 -Confirm:$false
Write-Host "APD handling enabled on $($esxHost.Name)"
}
Resolving PDL
PDL is far more painful because VMs will never self-recover. A colleague of mine once accidentally deleted a LUN on a NetApp during maintenance — 12 VMs dropped into PDL instantly, not one of them bootable. It took nearly 3 hours to recover everything.
Step 1: Identify Affected VMs
# Find VMs affected by PDL
Get-VM | Get-View | Where-Object {
$_.Runtime.ConnectionState -eq "inaccessible" -or
($_.Runtime.PowerState -eq "poweredOn" -and
$_.Runtime.ConnectionState -eq "disconnected")
} | Select-Object Name, @{N="ConnectionState";E={$_.Runtime.ConnectionState}}
Step 2: Force Power Off Hung VMs
VMs in a PDL state cannot be shut down normally. You need to kill them:
# SSH into the ESXi host running the VM
# Find the VM's world ID
esxcli vm process list
# Force kill the VM (replace 12345 with the actual world ID)
esxcli vm process kill --type=force --world-id=12345
# If force doesn't work, use hard kill
esxcli vm process kill --type=hard --world-id=12345
Step 3: Restore Storage and Re-register VMs
After fixing the underlying storage issue (recreating the LUN, remapping, or restoring from a storage array backup snapshot):
# Rescan so ESXi rediscovers the datastore
esxcli storage core adapter rescan --all
# Verify the datastore is accessible
esxcli storage vmfs extent list
# Re-register VMs if they were removed from inventory
$datastore = Get-Datastore "DS-Production-01"
$vmxFiles = Get-ChildItem -Path "vmstore:\datacenter\$($datastore.Name)" -Recurse -Filter "*.vmx"
foreach ($vmx in $vmxFiles) {
$resourcePool = Get-ResourcePool "Resources" -Location (Get-Cluster "Production-Cluster")
New-VM -VMFilePath $vmx.DatastoreFullPath -ResourcePool $resourcePool
Write-Host "Registered: $($vmx.Name)"
}
Monitoring Script for Early Detection
Getting woken up at 2 AM three times was enough. I wrote this script to run every 5 minutes via Task Scheduler — it sends alerts before an incident has a chance to escalate:
# apd-pdl-monitor.ps1 — run every 5 minutes via Task Scheduler or cron
Connect-VIServer -Server vcenter.company.com -User admin -Password $env:VC_PASS
$alerts = @()
# Check all datastores
Get-Datastore | ForEach-Object {
$ds = $_
if ($ds.State -ne "Available") {
$alerts += "[CRITICAL] Datastore '$($ds.Name)' state: $($ds.State)"
}
# Check host mount status
$ds.ExtensionData.Host | ForEach-Object {
$mountInfo = $_.MountInfo
if (-not $mountInfo.Accessible) {
$alerts += "[CRITICAL] Datastore '$($ds.Name)' inaccessible on host $($_.Key)"
}
}
}
# Check for disconnected VMs
Get-VM | Where-Object {$_.ExtensionData.Runtime.ConnectionState -eq "inaccessible"} | ForEach-Object {
$alerts += "[WARNING] VM '$($_.Name)' is inaccessible — possible APD/PDL"
}
if ($alerts.Count -gt 0) {
$body = $alerts -join "`n"
# Send email or Slack/Teams webhook
Send-MailMessage -To "[email protected]" -Subject "vSphere Storage Alert" -Body $body -SmtpServer "smtp.company.com"
Write-Host $body
}
Disconnect-VIServer -Confirm:$false
Lessons from Late-Night Incidents
After going through those situations, I’ve distilled four things to do immediately when storage starts acting up:
- Don’t restart the ESXi host right away — it’s a natural reflex, but the wrong one. Restarting a host with running VMs risks additional data loss.
- Read vmkernel.log first — 5 minutes reading logs saves 2 hours of blind troubleshooting.
- Distinguish APD from PDL before taking action — APD can resolve itself, PDL cannot. Waiting in the wrong place is wasted time.
- Configure HA/APD response in vSphere HA — go to Cluster Settings → vSphere HA → Advanced Options, set
das.config.fdm.apd.timeoutto your needs so HA handles incidents automatically while you’re asleep.
And the most fundamental thing: multipath must be configured correctly from the start. A single-path LUN is APD waiting to happen. You need a minimum of 2 independent paths — ideally 4 (2 HBAs × 2 fabrics).
