ESX/PSOD

From Omnia
< ESX
Jump to navigation Jump to search

PSOD

3nZHO.jpg

A PSOD (Purple Screen of Death) is the VMware ESX version of a Windows BSOD (Blue Screen of Death). This occurs when the kernel panics and can no longer function. There most common causes for a PSOD are:

  • Hardware failure
  • Out of memory
  • Hung CPU conditions
  • Misbehaving drivers (null pointers, invalid memory access, etc)
  • NMI (Non Maskable Interrupts)

When a PSOD occurs, one should collect the following:

  • Screenshot of PSOD kernel stack trace screen (if possible)
  • Support logs from the vm-support command
  • Kernel log(should be included in vm-support, but better safe then sorry)
  • Kernel core dump (only needed if a developer asks for it)

If the cause of the PSOD isn't obvious from the PSOD kernel stack trace screen, then the kernel log is the second best place to look for the cause of a kernel panic.

Core Dump Extract

To manually collect the kernel log:

Quick Log and Dump collection:

# will output: vmkernel-log.1 and vmkernel-zdump.1
# esxi 5.x will put kernel dump here: /scratch/core/vmkernel-zdump.*
# esxcfg-dumppart -C -D /vmfs/devices/disks/$( esxcfg-dumppart --get-active | awk '{print $1}' )
esxcfg-dumppart -L /vmfs/devices/disks/$( esxcfg-dumppart --get-active | awk '{print $1}' )
esxcfg-dumppart -C -D /vmfs/devices/disks/$( esxcfg-dumppart --get-active | awk '{print $1}' ) -z `pwd`/vmkernel-zdump

To extract the Kernel Log (vmkernel-log.1) from a existing Kernel Dump (vm-support:/var/core/vmkernel-zdump.1):

esxcfg-dumppart -L vmkernel-zdump.1

The core dumps are also collected as part of the vm-support tool collection:

vm-support

vmkernel dump version mismatch

If the esxcfg-dumppart version doesn't match the vmkernel dump:

Error running command. Unable to extract log. Error: vmkernel dump version mismatch!
Expected version: 196648, this dump file: 196647

vmkernel dump versions:

131106 - VMware ESXi 5.0.0 GA
131106 - VMware ESXi 5.0.0 Update 1
131106 - VMware ESXi 5.0.0 Update 2
131106 - VMware ESXi 5.0.0 Update 3
196647 - VMware ESXi 5.1.0 GA
196647 - VMware ESXi 5.1.0 Update 1
196648 - VMware ESXi 5.1.0 Update 2
262193 - VMware ESXi 5.5.0 GA
262194 - VMware ESXi 5.5.0 Update 1
262194 - VMware ESXi 5.5.0 Update 2
52 - VMware ESXi 6.0.0 GA (rc, may actually change with actual release)


Commands

List core dump partitions:

esxcfg-dumppart --list
esxcfg-dumppart --get-config

List active core dump partitions:

esxcfg-dumppart --get-active

Quick Log and Dump extract:

# output: vmkernel-log.1 and vmkernel-zdump.1
esxcfg-dumppart -L $( esxcfg-dumppart --get-active | awk '{print $2}' )
esxcfg-dumppart -C -D /vmfs/devices/disks/$( esxcfg-dumppart --get-active | awk '{print $1}' )
# ESXi sometimes dump to /scratch/core/vmkernel-zdump.1
mv /scratch/core/vmkernel-zdump.1 .
# or
esxcfg-dumppart -C -D /vmfs/devices/disks/$( esxcfg-dumppart --get-active | awk '{print $1}' ) -z /scratch/dump_out

Extract Log File from PSOD

Get core dump partition:

esxcfg-dumppart --list
esxcfg-dumppart --get-active  # second column

Extract kernel log:

esxcfg-dumppart -L [CORE_DUMP_PARTITION]
esxcfg-dumppart -L /dev/sda2    # esx
esxcfg-dumppart -L /vmfs/devices/disks/naa.60024e8073ba3100138d088b03c89bbf:7    # esxi

Tricky automatic kernel log extract:

esxcfg-dumppart -L $( esxcfg-dumppart --get-active | awk '{print $2}' )

VMware KB: Extracting the log file after an ESX or ESXi host fails with a purple screen error - http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=1006796

This article provides steps to extract a log from a vmkernel-zdump file after a purple diagnostic screen error. This log contains similar information to that seen on the purple diagnostic screen and can be used in further troubleshooting.

Extract the log file from a vmkernel-zdump file using a command line utility on the ESX or ESXi host. This utility differs for different versions of ESX or ESXi.

For ESX 3.0 and 3.5, use the vmkdump utility:

# vmkdump -l <vmkernel-zdump-filename>

For ESXi 3.5, ESX and ESXi 4.x, use the esxcfg-dumppart utility:

# esxcfg-dumppart -L <vmkernel-zdump-filename>

To extract the log file from a vmkernel-zdump file:

Find the vmkernel-zdump file in the /root/ or /var/core/ directory:

# ls /root/vmkernel* /var/core/vmkernel*
/var/core/vmkernel-zdump-073108.09.16.1

Use the vmkdump or esxcfg-dumppart utility to extract the log. For example:

# vmkdump -l /var/core/vmkernel-zdump-073108.09.16.1
created file vmkernel-log.1
# esxcfg-dumppart -L /var/core/vmkernel-zdump-073108.09.16.1
created file vmkernel-log.1

The vmkernel-log.1 file is plain text, though may start with null characters. Focus on the end of the log, which looks similar to:

VMware ESX Server [Releasebuild-98103]
PCPU 1 locked up. Failed to ack TLB invalidate.
frame=0x3a37d98 ip=0x625e94 cr2=0x0 cr3=0x40c66000 cr4=0x16c
es=0xffffffff ds=0xffffffff fs=0xffffffff gs=0xffffffff
eax=0xffffffff ebx=0xffffffff ecx=0xffffffff edx=0xffffffff
...

Note: The file name created for the log in this example is vmkernel-log.1. If another file with the same name already exists, the new file is created with the number suffix incremented.

Copy Core Dump

Use '-C or --copy':

esxcfg-dumppart -C
esxcfg-dumppart --copy --devname /vmfs/devices/disks/naa.xxxxx:x
              --newonly --zdumpname esxdump # (to copy new zdump only)

Tricky automatic core dump:

esxcfg-dumppart -C -D /vmfs/devices/disks/$( esxcfg-dumppart --get-active | awk '{print $1}' )

Deactivate Core Dump

Deactivate the active partition:

esxcfg-dumppart --deactivate

Activate the active partition:

esxcfg-dumppart --activate

Wipe dump partition: (must be deactivated)

esxcfg-dumppart --deactivate
#  dd if=/dev/zero of=$( esxcfg-dumppart --get-config | awk '{print $2}' ) conv=notrunc  # takes longer needlessly
dd if=/dev/zero of=$( esxcfg-dumppart --get-config | awk '{print $2}' ) count=512 conv=notrunc
esxcfg-dumppart --activate

Set and activate a partition:

# esxcfg-dumppart --set [DEVICE]:[PARTITION]
# OR # esxcfg-dumppart --set /vmfs/devices/disks/[DEVICE]:[PARTITION]
esxcfg-dumppart --set $( esxcfg-dumppart --get-config | awk '{print $1}' )

Dump Partition Example

This example, partition 7 is the dump partition - VMware ESXi 5.0.0 build-623860

For kicks:

# ["partNum startSector endSector type attr"]

Get Partitions: (interesting that the type for 7 doesn't indicate it is a dump parition here)

# partedUtil get $( esxcfg-dumppart --get-config | awk '{print $2}' | awk '{FS=":";print $1}' )
36468 255 63 585871964
1 64 8191 0 128
5 8224 520191 0 0
6 520224 1032191 0 0
7 1032224 1257471 0 0
8 1257504 1843199 0 0
2 1843200 10229759 0 0
3 10229760 585871930 0 0

Get Partitions with named types and ugly-type-id:

# partedUtil getptbl $( esxcfg-dumppart --get-config | awk '{print $2}' | awk '{FS=":";print $1}' )
gpt
36468 255 63 585871964
1 64 8191 C12A7328F81F11D2BA4B00A0C93EC93B systemPartition 128      # root (4MB)
5 8224 520191 EBD0A0A2B9E5443387C068B6B72699C7 linuxNative 0        # /bootbank (260MB)
6 520224 1032191 EBD0A0A2B9E5443387C068B6B72699C7 linuxNative 0     # /altbootbank (260MB)
7 1032224 1257471 9D27538040AD11DBBF97000C2911D1B8 vmkDiagnostic 0  # core dump partition (155MB)
8 1257504 1843199 EBD0A0A2B9E5443387C068B6B72699C7 linuxNative 0    # /store (300MB)
2 1843200 10229759 EBD0A0A2B9E5443387C068B6B72699C7 linuxNative 0   # /scratch (4.2GB)
3 10229760 585871930 AA31E02A400F11DB9590000C2911D1B8 vmfs 0        # datastore1 (remaining space)

Get usable first and last sectors:

# partedUtil getUsableSectors $( esxcfg-dumppart --get-config | awk '{print $2}' | awk '{FS=":";print $1}' )
34 585871930

Also interesting:

 # df
Filesystem         Bytes          Used    Available Use% Mounted on
VMFS-5      294473695232    1326448640 293147246592   0% /vmfs/volumes/datastore1
vfat          4293591040      98828288   4194762752   2% /vmfs/volumes/4fac472e-d1ddf2c4-a597-6431504f5534 (/scratch)
vfat           261853184     134205440    127647744  51% /vmfs/volumes/95ec6872-9e0d6b2c-3537-b1c307ab1cf4 (/bootbank)
vfat           261853184     147767296    114085888  56% /vmfs/volumes/b9d1ea75-466c705d-94db-eecb9f72749b (/altbootbank)
vfat           299712512     188481536    111230976  63% /vmfs/volumes/4fac4727-706fb5d8-d1cf-6431504f5534 (/store)
  • /vmfs/volumes/datastore1 matches partition 3 (off by ~250MB)
  • /vmfs/volumes/4fac472e-d1ddf2c4-a597-6431504f5534 (/scratch) matches partition 2 (off by ~300KB)
  • /vmfs/volumes/95ec6872-9e0d6b2c-3537-b1c307ab1cf4 (/bootbank) matches partition 5 or 6 (off by ~300KB)
  • /vmfs/volumes/b9d1ea75-466c705d-94db-eecb9f72749b (/altbootbank) matches partition 5 or 6 (off by ~300KB)
  • /vmfs/volumes/4fac4727-706fb5d8-d1cf-6431504f5534 (/store) matches partition 8 (off by ~160KB)

ESXi 5.0 Dump Collector

UDP port 6500

ESXi Network Dump Collector in VMware vSphere 5.0 - http://kb.vmware.com/kb/1032051

"The netdump protocol is used for sending coredumps from a failed ESXi host to the Dump Collector service. This service only supports IPv4. By default, this service listens on UDP port 6500. The network traffic is not encrypted, and there is no authentication or authorization mechanism to ensure the integrity or validity of any data received by the Dump Collector service. It is recommended that the VMkernel network used for network coredump collection be physically or logically segmented (such as a separate LAN/VLAN) to ensure that the traffic is not intercepted." [1]

---

Enable ESXi 5.x Dump Collector: (example for esxlogger)

# NOTE: Have to specify IP address!
# esxcli system coredump network set --interface-name vmk0 --server-ipv4 10.50.47.100 --server-port 6500
esxcli system coredump network set --interface-name vmk0 --server-ipv4 10.50.47.97 --server-port 6500
esxcli system coredump network set --enable true
auto-backup.sh

# (Optional) Check that ESXi Dump Collector is configured correctly:
esxcli system coredump network get

Test by triggering a core dump:

vsish -e set /reliability/crashMe/Panic

References:

---

Disable coredump:

esxcli system coredump network set --enable false

---

Unconfigured:

# esxcli system coredump network get
   Enabled: false
   Host VNic:
   Network Server IP:
   Network Server Port: 0

Configured:

# esxcli system coredump network get
   Enabled: true
   Host VNic: vmk0
   Network Server IP: 10.50.47.97
   Network Server Port: 6500

---

Check Network Dump Server Service:

Check services under VCSA management web interface:

https://VCSA:5480/

To check if the NetDumper service is running in VCSA: [2]

/etc/init.d/vmware-netdumper status

Check Network Dump Client: (and ability to talk to server) ESXi 5.1+

esxcli system coredump network check

--- VMware KB: Troubleshooting the ESXi Dump Collector service in vSphere 5.0 - http://kb.vmware.com/kb/2003042

Send test traffic from the ESXi host to the Dump Collector service at the IP Address and Port from step 5 using the command:

Example from KB:

# from an ESXi server...
nc -z -w 1 -s VMkernelIPAddress -u DumpCollectorIPAddressDumpCollectorPortNumber
# example:
nc -z -s 10.55.66.77 -u 10.11.12.13 6500

FIO Example:

virt-01# nc -z -s 10.50.48.38 -u 10.50.47.97 6500
Connection to 10.50.47.97 6500 port [udp/*] succeeded!
virt-01:/ # nc -z -u 10.50.47.97 6500
Connection to 10.50.47.97 6500 port [udp/*] succeeded!

Note: The nc command reports a successful connection regardless of whether the remote Netdump Server receives the traffic.

Review the logs from the receiving Dump Collector service for messages indicating that the connection was established.

For example, the vCenter 5.0 Dump Collector logs report the unknown client connection with a message similar to:

yyyy-mm-ddTHH:MM:SS.nnnZ| netdumper| Bad magic:0xa656761. Expected:0xadeca1bf
yyyy-mm-ddTHH:MM:SS.nnnZ| netdumper| Skipping bad packet.

---

Troubleshooting "Couldn't attach to dump server"

Error:

Starting network coredump from HostIP to DumpCollectorIP.
Netdump: FAILED: Couldn't attach to dump server at IP DumpCollectorIP.
Stopping Netdump.

Dump: nnn: APR timed out for IP DumpCollectorIP.

VMware KB: Troubleshooting the ESXi Dump Collector service in VMware vSphere 5.x - http://kb.vmware.com/kb/2003042

---

References:

---

If you move the storage:

mkdir /var/log/vmware/netdumper
chown netdumper:netdumper /var/log/vmware/netdumper

cat /etc/fstab

vi /etc/sysconfig/netdumper
   NETDUMPER_DIR="/storage/nfs.core.QzJe1rjk/vc/core/netdumps"

chown netdumper:netdumper /storage/nfs.core.QzJe1rjk/vc/core/netdumps

/etc/init.d/vmware-netdumper status
/etc/init.d/vmware-netdumper start

cat /var/log/vmware/netdumper/netdumper.log

Trigger PSOD

Crash system:

vsish -e set /reliability/crashMe/Panic

Interactively:

vsish
  cd /reliability/crashMe/
  set /reliability/crashMe/Panic

PSOD timeout:

 /sbin/vsish -e set /config/Misc/intOpts/BlueScreenTimeout 10

how IOVP calls it:

 sleep 5;/sbin/vsish -e set /reliability/crashMe/Panic 1


KB Articles

VMware KB: Manually regenerating core dump files in VMware ESX and ESXi - http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1002769

This article provides instructions to extract a core dump file from the VMKCore partition following a purple screen error.

VMware KB: Collecting diagnostic information from an ESX or ESXi host that experiences a purple diagnostic screen - http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1004128

This article provides instruction for collecting support diagnostic information when troubleshooting a purple screen fault in VMware ESX or ESXi.

VMware KB: Configuring an ESX/ESXi host to capture a VMkernel coredump from a purple diagnostic screen - http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1000328

This article provides an overview of configuring VMware ESX/ESXi with a location for storing diagnostic information during a purple diagnostic screen and host failure.

VMware KB: Configuring an ESXi 5.0 host to capture a VMkernel coredump from a purple diagnostic screen to a diagnostic partition - http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=2004299

This article provides steps for adding a VMKcore diagnostic partition on a local or shared disk post-installation using the esxcli command line utility. A diagnostic partition can also be created using the vSphere Client.
  • several commands for ESXi 5

VMware KB: Configuring an ESX/ESXi 3.0-4.1 host to capture a VMkernel coredump from a purple diagnostic screen to a diagnostic partition - http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=2004297

This article provides steps for adding a VMKcore diagnostic partition on a local or shared disk post-installation.
esxcfg-dumppart --list
esxcfg-dumppart --set "<VM Kernel Name>"
esxcfg-dumppart --set "mpx.vmhba2:C0:T0:L0:7"
esxcfg-dumppart --smart-activate
esxcfg-dumppart --get-active

VMware KB: Interpreting an ESX host purple diagnostic screen - http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=1004250

This article provides information to decode ESX host purple screen errors.

VMware KB: Extracting the log file after an ESX or ESXi host fails with a purple screen error - http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=1006796

This article provides steps to extract a log from a vmkernel-zdump file after a purple diagnostic screen error. This log contains similar information to that seen on the purple diagnostic screen and can be used in further troubleshooting.

VMware KB: Interpreting an ESX host purple diagnostic screen - http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=1004250

This article provides information to decode ESX host purple screen errors.

References:

  • "purple diagnostic (PSOD) screen" [3]

Issues

Out of space

Error:

DiskDump: Partial Dump: Out of space o=0x63ff200 I

Cause:

  • "This issue occurs because the default slot size for the core dump partition cannot accommodate a complete core dump of a host that is using large amounts of memory."

Solution:

  • Select another paritition for core dumps
  • Use ESXi 5.0 Dump Collector (generally preferred)

References:

keywords

vmware psod core dump kernel dump coredump capture-vmkernel-coredump-psod