ESX/PSOD

PSOD
http://i.snag.gy/3nZHO.jpg

A PSOD (Purple Screen of Death) is the VMware ESX version of a Windows BSOD (Blue Screen of Death). This occurs when the kernel panics and can no longer function. There most common causes for a PSOD are:


 * Hardware failure
 * Out of memory
 * Hung CPU conditions
 * Misbehaving drivers (null pointers, invalid memory access, etc)
 * NMI (Non Maskable Interrupts)

When a PSOD occurs, one should collect the following:


 * Screenshot of PSOD kernel stack trace screen (if possible)
 * Support logs from the vm-support command
 * Kernel log(should be included in vm-support, but better safe then sorry)
 * Kernel core dump (only needed if a developer asks for it)

If the cause of the PSOD isn't obvious from the PSOD kernel stack trace screen, then the kernel log is the second best place to look for the cause of a kernel panic.

Core Dump Extract
To manually collect the kernel log:

Quick Log and Dump collection: esxcfg-dumppart -L /vmfs/devices/disks/$( esxcfg-dumppart --get-active | awk '{print $1}' ) esxcfg-dumppart -C -D /vmfs/devices/disks/$( esxcfg-dumppart --get-active | awk '{print $1}' ) -z `pwd`/vmkernel-zdump
 * 1) will output: vmkernel-log.1 and vmkernel-zdump.1
 * 2) esxi 5.x will put kernel dump here: /scratch/core/vmkernel-zdump.*
 * 3) esxcfg-dumppart -C -D /vmfs/devices/disks/$( esxcfg-dumppart --get-active | awk '{print $1}' )

To extract the Kernel Log (vmkernel-log.1) from a existing Kernel Dump (vm-support:/var/core/vmkernel-zdump.1): esxcfg-dumppart -L vmkernel-zdump.1

The core dumps are also collected as part of the vm-support tool collection: vm-support

vmkernel dump version mismatch
If the esxcfg-dumppart version doesn't match the vmkernel dump: Error running command. Unable to extract log. Error: vmkernel dump version mismatch! Expected version: 196648, this dump file: 196647

vmkernel dump versions: 131106 - VMware ESXi 5.0.0 GA 131106 - VMware ESXi 5.0.0 Update 1 131106 - VMware ESXi 5.0.0 Update 2 131106 - VMware ESXi 5.0.0 Update 3

196647 - VMware ESXi 5.1.0 GA 196647 - VMware ESXi 5.1.0 Update 1

196648 - VMware ESXi 5.1.0 Update 2

262193 - VMware ESXi 5.5.0 GA

262194 - VMware ESXi 5.5.0 Update 1 262194 - VMware ESXi 5.5.0 Update 2

52 - VMware ESXi 6.0.0 GA (rc, may actually change with actual release)

Commands
List core dump partitions: esxcfg-dumppart --list esxcfg-dumppart --get-config

List active core dump partitions: esxcfg-dumppart --get-active

Quick Log and Dump extract: esxcfg-dumppart -L $( esxcfg-dumppart --get-active | awk '{print $2}' ) esxcfg-dumppart -C -D /vmfs/devices/disks/$( esxcfg-dumppart --get-active | awk '{print $1}' )
 * 1) output: vmkernel-log.1 and vmkernel-zdump.1

mv /scratch/core/vmkernel-zdump.1. # or esxcfg-dumppart -C -D /vmfs/devices/disks/$( esxcfg-dumppart --get-active | awk '{print $1}' ) -z /scratch/dump_out
 * 1) ESXi sometimes dump to /scratch/core/vmkernel-zdump.1

Extract Log File from PSOD
Get core dump partition: esxcfg-dumppart --list esxcfg-dumppart --get-active # second column

Extract kernel log: esxcfg-dumppart -L [CORE_DUMP_PARTITION] esxcfg-dumppart -L /dev/sda2   # esx esxcfg-dumppart -L /vmfs/devices/disks/naa.60024e8073ba3100138d088b03c89bbf:7   # esxi

Tricky automatic kernel log extract: esxcfg-dumppart -L $( esxcfg-dumppart --get-active | awk '{print $2}' )

VMware KB: Extracting the log file after an ESX or ESXi host fails with a purple screen error - http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=1006796
 * This article provides steps to extract a log from a vmkernel-zdump file after a purple diagnostic screen error. This log contains similar information to that seen on the purple diagnostic screen and can be used in further troubleshooting.

Extract the log file from a vmkernel-zdump file using a command line utility on the ESX or ESXi host. This utility differs for different versions of ESX or ESXi.

For ESX 3.0 and 3.5, use the vmkdump utility:


 * 1) vmkdump -l 

For ESXi 3.5, ESX and ESXi 4.x, use the esxcfg-dumppart utility:


 * 1) esxcfg-dumppart -L 

To extract the log file from a vmkernel-zdump file:

Find the vmkernel-zdump file in the /root/ or /var/core/ directory:

/var/core/vmkernel-zdump-073108.09.16.1
 * 1) ls /root/vmkernel* /var/core/vmkernel*

Use the vmkdump or esxcfg-dumppart utility to extract the log. For example:

created file vmkernel-log.1
 * 1) vmkdump -l /var/core/vmkernel-zdump-073108.09.16.1

created file vmkernel-log.1
 * 1) esxcfg-dumppart -L /var/core/vmkernel-zdump-073108.09.16.1

The vmkernel-log.1 file is plain text, though may start with null characters. Focus on the end of the log, which looks similar to:

VMware ESX Server [Releasebuild-98103] PCPU 1 locked up. Failed to ack TLB invalidate. frame=0x3a37d98 ip=0x625e94 cr2=0x0 cr3=0x40c66000 cr4=0x16c es=0xffffffff ds=0xffffffff fs=0xffffffff gs=0xffffffff eax=0xffffffff ebx=0xffffffff ecx=0xffffffff edx=0xffffffff ...

Note: The file name created for the log in this example is vmkernel-log.1. If another file with the same name already exists, the new file is created with the number suffix incremented.

Copy Core Dump
Use '-C or --copy': esxcfg-dumppart -C

esxcfg-dumppart --copy --devname /vmfs/devices/disks/naa.xxxxx:x --newonly --zdumpname esxdump # (to copy new zdump only)

Tricky automatic core dump: esxcfg-dumppart -C -D /vmfs/devices/disks/$( esxcfg-dumppart --get-active | awk '{print $1}' )

Deactivate Core Dump
Deactivate the active partition: esxcfg-dumppart --deactivate

Activate the active partition: esxcfg-dumppart --activate

Wipe dump partition: (must be deactivated) esxcfg-dumppart --deactivate dd if=/dev/zero of=$( esxcfg-dumppart --get-config | awk '{print $2}' ) count=512 conv=notrunc esxcfg-dumppart --activate
 * 1)  dd if=/dev/zero of=$( esxcfg-dumppart --get-config | awk '{print $2}' ) conv=notrunc  # takes longer needlessly

Set and activate a partition: esxcfg-dumppart --set $( esxcfg-dumppart --get-config | awk '{print $1}' )
 * 1) esxcfg-dumppart --set [DEVICE]:[PARTITION]
 * 2) OR # esxcfg-dumppart --set /vmfs/devices/disks/[DEVICE]:[PARTITION]

Dump Partition Example
This example, partition 7 is the dump partition - VMware ESXi 5.0.0 build-623860

For kicks:
 * 1) ["partNum startSector endSector type attr"]

Get Partitions: (interesting that the type for 7 doesn't indicate it is a dump parition here) 36468 255 63 585871964 1 64 8191 0 128 5 8224 520191 0 0 6 520224 1032191 0 0 7 1032224 1257471 0 0 8 1257504 1843199 0 0 2 1843200 10229759 0 0 3 10229760 585871930 0 0
 * 1) partedUtil get $( esxcfg-dumppart --get-config | awk '{print $2}' | awk '{FS=":";print $1}' )

Get Partitions with named types and ugly-type-id: gpt 36468 255 63 585871964 1 64 8191 C12A7328F81F11D2BA4B00A0C93EC93B systemPartition 128     # root (4MB) 5 8224 520191 EBD0A0A2B9E5443387C068B6B72699C7 linuxNative 0       # /bootbank (260MB) 6 520224 1032191 EBD0A0A2B9E5443387C068B6B72699C7 linuxNative 0    # /altbootbank (260MB) 7 1032224 1257471 9D27538040AD11DBBF97000C2911D1B8 vmkDiagnostic 0 # core dump partition (155MB) 8 1257504 1843199 EBD0A0A2B9E5443387C068B6B72699C7 linuxNative 0   # /store (300MB) 2 1843200 10229759 EBD0A0A2B9E5443387C068B6B72699C7 linuxNative 0  # /scratch (4.2GB) 3 10229760 585871930 AA31E02A400F11DB9590000C2911D1B8 vmfs 0       # datastore1 (remaining space)
 * 1) partedUtil getptbl $( esxcfg-dumppart --get-config | awk '{print $2}' | awk '{FS=":";print $1}' )

Get usable first and last sectors: 34 585871930
 * 1) partedUtil getUsableSectors $( esxcfg-dumppart --get-config | awk '{print $2}' | awk '{FS=":";print $1}' )

Also interesting: # df Filesystem        Bytes          Used    Available Use% Mounted on VMFS-5      294473695232    1326448640 293147246592   0% /vmfs/volumes/datastore1 vfat         4293591040      98828288   4194762752   2% /vmfs/volumes/4fac472e-d1ddf2c4-a597-6431504f5534 (/scratch) vfat          261853184     134205440    127647744  51% /vmfs/volumes/95ec6872-9e0d6b2c-3537-b1c307ab1cf4 (/bootbank) vfat          261853184     147767296    114085888  56% /vmfs/volumes/b9d1ea75-466c705d-94db-eecb9f72749b (/altbootbank) vfat          299712512     188481536    111230976  63% /vmfs/volumes/4fac4727-706fb5d8-d1cf-6431504f5534 (/store)


 * /vmfs/volumes/datastore1 matches partition 3 (off by ~250MB)
 * /vmfs/volumes/4fac472e-d1ddf2c4-a597-6431504f5534 (/scratch) matches partition 2 (off by ~300KB)
 * /vmfs/volumes/95ec6872-9e0d6b2c-3537-b1c307ab1cf4 (/bootbank) matches partition 5 or 6 (off by ~300KB)
 * /vmfs/volumes/b9d1ea75-466c705d-94db-eecb9f72749b (/altbootbank) matches partition 5 or 6 (off by ~300KB)
 * /vmfs/volumes/4fac4727-706fb5d8-d1cf-6431504f5534 (/store) matches partition 8 (off by ~160KB)

ESXi 5.0 Dump Collector
UDP port 6500

ESXi Network Dump Collector in VMware vSphere 5.0 - http://kb.vmware.com/kb/1032051

"The netdump protocol is used for sending coredumps from a failed ESXi host to the Dump Collector service. This service only supports IPv4. By default, this service listens on UDP port 6500. The network traffic is not encrypted, and there is no authentication or authorization mechanism to ensure the integrity or validity of any data received by the Dump Collector service. It is recommended that the VMkernel network used for network coredump collection be physically or logically segmented (such as a separate LAN/VLAN) to ensure that the traffic is not intercepted." 

---

Enable ESXi 5.x Dump Collector: (example for esxlogger) esxcli system coredump network set --interface-name vmk0 --server-ipv4 10.50.47.97 --server-port 6500 esxcli system coredump network set --enable true auto-backup.sh esxcli system coredump network get
 * 1) NOTE: Have to specify IP address!
 * 2) esxcli system coredump network set --interface-name vmk0 --server-ipv4 10.50.47.100 --server-port 6500
 * 1) (Optional) Check that ESXi Dump Collector is configured correctly:

Test by triggering a core dump: vsish -e set /reliability/crashMe/Panic

References:
 * Configure ESXi Dump Collector with ESXCLI

---

Disable coredump: esxcli system coredump network set --enable false

---

Unconfigured: Enabled: false Host VNic: Network Server IP: Network Server Port: 0
 * 1) esxcli system coredump network get

Configured: Enabled: true Host VNic: vmk0 Network Server IP: 10.50.47.97 Network Server Port: 6500
 * 1) esxcli system coredump network get

---

Check Network Dump Server Service:

Check services under VCSA management web interface: https://VCSA:5480/

To check if the NetDumper service is running in VCSA: /etc/init.d/vmware-netdumper status

Check Network Dump Client: (and ability to talk to server) ESXi 5.1+ esxcli system coredump network check

--- VMware KB: Troubleshooting the ESXi Dump Collector service in vSphere 5.0 - http://kb.vmware.com/kb/2003042

Send test traffic from the ESXi host to the Dump Collector service at the IP Address and Port from step 5 using the command:

Example from KB: nc -z -w 1 -s VMkernelIPAddress -u DumpCollectorIPAddressDumpCollectorPortNumber nc -z -s 10.55.66.77 -u 10.11.12.13 6500
 * 1) from an ESXi server...
 * 1) example:

FIO Example: virt-01# nc -z -s 10.50.48.38 -u 10.50.47.97 6500 Connection to 10.50.47.97 6500 port [udp/*] succeeded!

virt-01:/ # nc -z -u 10.50.47.97 6500 Connection to 10.50.47.97 6500 port [udp/*] succeeded!

Note: The nc command reports a successful connection regardless of whether the remote Netdump Server receives the traffic.

Review the logs from the receiving Dump Collector service for messages indicating that the connection was established.

For example, the vCenter 5.0 Dump Collector logs report the unknown client connection with a message similar to: yyyy-mm-ddTHH:MM:SS.nnnZ| netdumper| Bad magic:0xa656761. Expected:0xadeca1bf yyyy-mm-ddTHH:MM:SS.nnnZ| netdumper| Skipping bad packet.

---

Troubleshooting "Couldn't attach to dump server"

Error: Starting network coredump from HostIP to DumpCollectorIP. Netdump: FAILED: Couldn't attach to dump server at IP DumpCollectorIP. Stopping Netdump. Dump: nnn: APR timed out for IP DumpCollectorIP.

VMware KB: Troubleshooting the ESXi Dump Collector service in VMware vSphere 5.x - http://kb.vmware.com/kb/2003042

---

References:
 * VMware: VMware ESXi Chronicles: Setting up the ESXi 5.0 Dump Collector - http://blogs.vmware.com/esxi/2011/07/setting-up-the-esxi-50-dump-collector.html

---

If you move the storage:

mkdir /var/log/vmware/netdumper chown netdumper:netdumper /var/log/vmware/netdumper

cat /etc/fstab

vi /etc/sysconfig/netdumper NETDUMPER_DIR="/storage/nfs.core.QzJe1rjk/vc/core/netdumps"

chown netdumper:netdumper /storage/nfs.core.QzJe1rjk/vc/core/netdumps

/etc/init.d/vmware-netdumper status /etc/init.d/vmware-netdumper start

cat /var/log/vmware/netdumper/netdumper.log

Trigger PSOD
Crash system: vsish -e set /reliability/crashMe/Panic

Interactively: vsish cd /reliability/crashMe/ set /reliability/crashMe/Panic

PSOD timeout: /sbin/vsish -e set /config/Misc/intOpts/BlueScreenTimeout 10

how IOVP calls it: sleep 5;/sbin/vsish -e set /reliability/crashMe/Panic 1

KB Articles
VMware KB: Manually regenerating core dump files in VMware ESX and ESXi - http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1002769
 * This article provides instructions to extract a core dump file from the VMKCore partition following a purple screen error.

VMware KB: Collecting diagnostic information from an ESX or ESXi host that experiences a purple diagnostic screen - http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1004128
 * This article provides instruction for collecting support diagnostic information when troubleshooting a purple screen fault in VMware ESX or ESXi.

VMware KB: Configuring an ESX/ESXi host to capture a VMkernel coredump from a purple diagnostic screen - http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1000328
 * This article provides an overview of configuring VMware ESX/ESXi with a location for storing diagnostic information during a purple diagnostic screen and host failure.

VMware KB: Configuring an ESXi 5.0 host to capture a VMkernel coredump from a purple diagnostic screen to a diagnostic partition - http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=2004299
 * This article provides steps for adding a VMKcore diagnostic partition on a local or shared disk post-installation using the esxcli command line utility. A diagnostic partition can also be created using the vSphere Client.


 * several commands for ESXi 5

VMware KB: Configuring an ESX/ESXi 3.0-4.1 host to capture a VMkernel coredump from a purple diagnostic screen to a diagnostic partition - http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=2004297
 * This article provides steps for adding a VMKcore diagnostic partition on a local or shared disk post-installation.

esxcfg-dumppart --list esxcfg-dumppart --set "" esxcfg-dumppart --set "mpx.vmhba2:C0:T0:L0:7" esxcfg-dumppart --smart-activate esxcfg-dumppart --get-active

VMware KB: Interpreting an ESX host purple diagnostic screen - http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=1004250
 * This article provides information to decode ESX host purple screen errors.

VMware KB: Extracting the log file after an ESX or ESXi host fails with a purple screen error - http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=1006796
 * This article provides steps to extract a log from a vmkernel-zdump file after a purple diagnostic screen error. This log contains similar information to that seen on the purple diagnostic screen and can be used in further troubleshooting.

VMware KB: Interpreting an ESX host purple diagnostic screen - http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=1004250
 * This article provides information to decode ESX host purple screen errors.

References:
 * "purple diagnostic (PSOD) screen"

Out of space
Error: DiskDump: Partial Dump: Out of space o=0x63ff200 I

Cause:
 * "This issue occurs because the default slot size for the core dump partition cannot accommodate a complete core dump of a host that is using large amounts of memory."

Solution:
 * Select another paritition for core dumps
 * Use ESXi 5.0 Dump Collector (generally preferred)

References:
 * VMware KB: ESXi hosts with more than 128 GB of physical memory fail to generate valid core dumps - http://kb.vmware.com/kb/2012362

keywords
vmware psod core dump kernel dump coredump capture-vmkernel-coredump-psod