Handling APD and PDL Errors in VMware vSphere: Understanding and Resolving All Paths Down and Permanent Device Loss

VMware tutorial - IT technology blog
VMware tutorial - IT technology blog

2 AM. My phone buzzes. A vCenter alert: a wave of VMs showing Not Responding. I open the console — datastores have turned yellow, vmkernel logs are lit up red. This is a scenario I’ve been through three times over the years managing a VMware cluster with 8 ESXi hosts at my company — and every single time, the first question is always: is this APD or PDL?

The two conditions look identical on the surface — VMs frozen, I/O timeouts, storage inaccessible — but the root causes and remediation steps are completely different. Confusing one for the other can cost you an extra hour of downtime, or worse, corrupt your data.

What’s the Difference Between APD and PDL?

All Paths Down (APD)

APD occurs when all paths from an ESXi host to a storage device are lost, but ESXi cannot yet determine whether the loss is temporary or permanent. The host keeps the I/O queue open, continues retrying, and places the device into an “APD timeout” state after exactly 140 seconds by default.

Common causes:

  • Broken FC cable, failed SFP, unexpected SAN switch restart
  • iSCSI network disruption (VLAN misconfiguration, NIC flap)
  • Storage array temporarily offline due to unannounced maintenance
  • Zoning changes on an FC switch

Permanent Device Loss (PDL)

PDL is a step more severe than APD. The storage array returns a SCSI sense code confirming the device no longer exists — specifically sense code 0x05/0x25 (logical unit not supported) or 0x02/0x3a (medium not present). ESXi receives this unambiguous signal and immediately knows: the loss is permanent, no need to wait for a timeout.

Situations that lead to PDL:

  • A LUN is deleted or unmapped directly on the storage array
  • Incomplete storage array failover
  • A vSAN disk group fails entirely
  • HBA driver crash causing the device to detach at the kernel level

Quick Comparison

  • APD: Unclear whether temporary or permanent — ESXi waits 140s then enters timeout state, VMs hang, may self-recover when connectivity is restored
  • PDL: Confirmed permanent immediately via SCSI sense code — instant response, VMs receive I/O errors, manual intervention required for recovery

Determining Whether You Have APD or PDL

Don’t guess. SSH into the affected ESXi host and run these commands immediately:

# Check path status to storage devices
esxcli storage nmp path list | grep -A5 "State"

# Find APD/PDL events in vmkernel log
grep -i "APD\|PDL\|permanent device\|all paths down" /var/log/vmkernel.log | tail -50

# Check specific device status
esxcli storage nmp device list

# View SCSI sense codes (critical for distinguishing PDL)
grep -i "H:0x0 D:0x2 P:0x0 Valid sense\|sense data" /var/log/vmkernel.log | tail -20

At the same time, open the vCenter Event Log and filter by the affected host. APD will show the event esx.problem.storage.apd.start, while PDL will show esx.problem.storage.permanentDeviceLoss.set. Seeing either one tells you exactly what you’re dealing with.

Resolving APD

Step 1: Check Physical Connectivity

Before touching software, check the hardware first. I once spent 20 minutes troubleshooting on the CLI before discovering that a datacenter operator had accidentally unplugged an FC cable. An expensive lesson.

# Check HBA port status
esxcli storage san fc list

# For iSCSI, check network connectivity
esxcli iscsi adapter list
esxcli iscsi session list

# Ping storage IP from VMkernel interface
vmkping -I vmk1 192.168.10.50

Step 2: Rescan Storage Adapters

Physical connectivity confirmed? Rescan so ESXi can rediscover the paths:

# Rescan all storage adapters
esxcli storage core adapter rescan --all

# Or rescan a specific adapter (e.g. vmhba1)
esxcli storage core adapter rescan --adapter vmhba1

After the rescan, VMs typically resume on their own if connectivity has been restored. No VM restart needed.

Customizing APD Timeout and Response

By default, ESXi waits 140 seconds before entering APD timeout. After that, HA can automatically power off hung VMs — instead of leaving them suspended indefinitely. Configure this via PowerCLI:

# Connect to vCenter
Connect-VIServer -Server vcenter.company.com

# Enable APD handling on all hosts in the cluster
# When APD timeout is reached, HA is allowed to power off hung VMs instead of waiting indefinitely
$cluster = Get-Cluster "Production-Cluster"
Get-VMHost -Location $cluster | ForEach-Object {
    $esxHost = $_
    Get-AdvancedSetting -Entity $esxHost -Name "Disk.APDHandlingEnable" |
        Set-AdvancedSetting -Value 1 -Confirm:$false
    Write-Host "APD handling enabled on $($esxHost.Name)"
}

Resolving PDL

PDL is far more painful because VMs will never self-recover. A colleague of mine once accidentally deleted a LUN on a NetApp during maintenance — 12 VMs dropped into PDL instantly, not one of them bootable. It took nearly 3 hours to recover everything.

Step 1: Identify Affected VMs

# Find VMs affected by PDL
Get-VM | Get-View | Where-Object {
    $_.Runtime.ConnectionState -eq "inaccessible" -or
    ($_.Runtime.PowerState -eq "poweredOn" -and
    $_.Runtime.ConnectionState -eq "disconnected")
} | Select-Object Name, @{N="ConnectionState";E={$_.Runtime.ConnectionState}}

Step 2: Force Power Off Hung VMs

VMs in a PDL state cannot be shut down normally. You need to kill them:

# SSH into the ESXi host running the VM
# Find the VM's world ID
esxcli vm process list

# Force kill the VM (replace 12345 with the actual world ID)
esxcli vm process kill --type=force --world-id=12345

# If force doesn't work, use hard kill
esxcli vm process kill --type=hard --world-id=12345

Step 3: Restore Storage and Re-register VMs

After fixing the underlying storage issue (recreating the LUN, remapping, or restoring from a storage array backup snapshot):

# Rescan so ESXi rediscovers the datastore
esxcli storage core adapter rescan --all

# Verify the datastore is accessible
esxcli storage vmfs extent list
# Re-register VMs if they were removed from inventory
$datastore = Get-Datastore "DS-Production-01"
$vmxFiles = Get-ChildItem -Path "vmstore:\datacenter\$($datastore.Name)" -Recurse -Filter "*.vmx"

foreach ($vmx in $vmxFiles) {
    $resourcePool = Get-ResourcePool "Resources" -Location (Get-Cluster "Production-Cluster")
    New-VM -VMFilePath $vmx.DatastoreFullPath -ResourcePool $resourcePool
    Write-Host "Registered: $($vmx.Name)"
}

Monitoring Script for Early Detection

Getting woken up at 2 AM three times was enough. I wrote this script to run every 5 minutes via Task Scheduler — it sends alerts before an incident has a chance to escalate:

# apd-pdl-monitor.ps1 — run every 5 minutes via Task Scheduler or cron
Connect-VIServer -Server vcenter.company.com -User admin -Password $env:VC_PASS

$alerts = @()

# Check all datastores
Get-Datastore | ForEach-Object {
    $ds = $_
    if ($ds.State -ne "Available") {
        $alerts += "[CRITICAL] Datastore '$($ds.Name)' state: $($ds.State)"
    }

    # Check host mount status
    $ds.ExtensionData.Host | ForEach-Object {
        $mountInfo = $_.MountInfo
        if (-not $mountInfo.Accessible) {
            $alerts += "[CRITICAL] Datastore '$($ds.Name)' inaccessible on host $($_.Key)"
        }
    }
}

# Check for disconnected VMs
Get-VM | Where-Object {$_.ExtensionData.Runtime.ConnectionState -eq "inaccessible"} | ForEach-Object {
    $alerts += "[WARNING] VM '$($_.Name)' is inaccessible — possible APD/PDL"
}

if ($alerts.Count -gt 0) {
    $body = $alerts -join "`n"
    # Send email or Slack/Teams webhook
    Send-MailMessage -To "[email protected]" -Subject "vSphere Storage Alert" -Body $body -SmtpServer "smtp.company.com"
    Write-Host $body
}

Disconnect-VIServer -Confirm:$false

Lessons from Late-Night Incidents

After going through those situations, I’ve distilled four things to do immediately when storage starts acting up:

  1. Don’t restart the ESXi host right away — it’s a natural reflex, but the wrong one. Restarting a host with running VMs risks additional data loss.
  2. Read vmkernel.log first — 5 minutes reading logs saves 2 hours of blind troubleshooting.
  3. Distinguish APD from PDL before taking action — APD can resolve itself, PDL cannot. Waiting in the wrong place is wasted time.
  4. Configure HA/APD response in vSphere HA — go to Cluster Settings → vSphere HA → Advanced Options, set das.config.fdm.apd.timeout to your needs so HA handles incidents automatically while you’re asleep.

And the most fundamental thing: multipath must be configured correctly from the start. A single-path LUN is APD waiting to happen. You need a minimum of 2 independent paths — ideally 4 (2 HBAs × 2 fabrics).

Share: