Decoding PSOD (Purple Screen of Death) of ESXi to find initial cause of crash

What is the VMkernel?

The VMkernel is the operating system core of ESX/ESXi. The kernel handles resource scheduling and device IO. Device IO is handled by the VMware network and storage stacks, which serves as a layer between the virtual file system, network devices and the device drivers that control physical devices.

Interpreting the purple diagnostic screen:

  • 1st line will show the ESXi version alongwith Build Number. This is required to correctly identify cause.
  • 2nd line will show error message. In this case error message is:

CPU 52 / World 68776 tried to re-acquire lock

  • Next is CPU registers which are values present in physical CPU when error occurred.
  • PCPU in case of ESXi 6.5 and *<n> where n = 0,1,2.. will identify the physical CPU of ESXi host which was running the instruction during the error occurred.
  • VMK Uptime will show uptime of host. It can be in starting or at the end of trace. This section indicates how long a server is running since the last boot. In this example, the ESX host was running for 7 days, 15 hours, 54 minutes and 58.089 seconds. In older ESXi, The TSC value is the number of CPU clock cycles that have elapsed since the server was started.
  • Error stack trace – The stack represents what the VMkernel was doing at the time of the error.

Instruction with 0x________ are error stack trace.

  • The Core Dump:

coredump to disk. slot 1 of 1 on device

This section of the purple diagnostic screen indicates that the contents of the VMkernel memory are being copied to the vmkcore partition.

Using the error message of the purple diagnostic screen to troubleshoot a vmkernel error:

Most common error message and their cause:

Type: Console Oops
Example Error: COS Error: Oops
Description: An ESX host fails and causes a purple screen when there is a Service Console oops. Unlike most purple screen errors, it is not triggered by the VMkernel. Instead the error is triggered by the Service Console and occurs at the Linux level. These purple screen errors contain additional information from the Linux kernel. For more information about Console Oops, see Understanding an “Oops” purple diagnostic screen (1006802)

Type: Lost Heartbeat
Example Error: Lost Heartbeat
Description: The ESX VMkernel and the Service Console Linux kernel run at the same time on ESX. The Service Console Linux kernel runs a process called vmnixhbd, which heartbeats the VMkernel as long as it is able to allocate and free a page of memory. If no heartbeats are received before a timeout period of 30 minutes, the VMkernel triggers a COS Panic and a purple diagnostics screen that mentions a Lost Heartbeat. For more information on Lost Heatbeats, see Understanding a “Lost Heartbeat” purple diagnostic screen (1009525) .

Type: Assert
Example Error: ASSERT bora/vmkernel/main/pframe_int.h:527 
Description: Assert errors are software errors, because they are related to assumptions on which the program is based. This type of purple screen error is primarily caused by software issues. For more information on the assert error message, see Understanding ASSERT and NOT_IMPLEMENTED purple diagnostic screens (1019956)

Type: Not Implemented
Example Error: NOT_IMPLEMENTED /build/mts/release/bora-84374/bora/vmkernel/main/util.c:83
Description: A not implemented error message occurs when the code encounters a situation that it was not designed to handle. For more information, see Understanding ASSERT and NOT_IMPLEMENTED purple diagnostic screens (1019956).

Type: Spin count exceeded / Possible deadlock
Example Error: Spin count exceeded (iplLock) – possible deadlock
Description: A VMware ESX host may report a Spin count exceeded and possible deadlock in a purple diagnostic screen when a thread is attempting to execute in the critical section of code. Since it was trying to enter the critical section, the thread needed to poll a mutex for a lock prior to executing the code by conducting a spinlock operation. The thread continues to poll the mutex during the spinlock operation, but there is a certain limit of how many times it polls the mutex. For more information on Spin count exceeded errors, see Understanding a “Spin count exceeded” purple diagnostic screen (1020105)

Type: Failed to ack TLB invalidate
Example Error: PCPU 1 locked up. Failed to ack TLB invalidate.
Description: Physical CPUs fail when trying to clear memory page tables. For more information, see Understanding a Failed to ack TLB invalidate purple diagnostic screen (1020214).

The VMkernel error message generated by the purple screen can be used to identify the cause of the issue. The number of error messages that can be produced are finite. This is a list of known VMkernel error messages.

If error is not present in these try to search them online with ESXi version and they will point to root cause.

For this e.g. it was caused due to due to a known issue in the hp-ilo driver 650.10.0.1-24.

Reference VMware Article: https://kb.vmware.com/s/article/2148123

Happy Troubleshooting.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.