First published on MSDN on Nov 13, 2013
In the System event log you may find an event similar to the following:
Event ID: 1001
Description: The computer has rebooted from a bugcheck. The bugcheck was: 0x0000009e (0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000). A dump was saved in: C:WindowsMEMORY.DMP.
Let’s start out discussing what a STOP 0x9e is. Failover Clustering actively conducts health monitoring of many components and at different layers of a server, one of the attributes of a highly available system is to have the health detection mechanisms in place to detect when something goes wrong and to react. Under some conditions when an extreme failure occurs, the cluster service may intentionally bugcheck the server in an attempt to recover. The bugcheck will be a USER_MODE_HEALTH_MONITOR (9e) and invoked by the Failover Cluster kernel mode driver NetFT.sys.
The first and most important thing to understand is that this is normal cluster health detection and recovery, it is intended recovery behavior. It is not a “bug” in clustering, nor is it a bug in NetFT.sys… it is a feature, not a flaw. I say this, because the most common first troubleshooting step I see is that customers apply the latest hotfix for NetFT.sys… and that won’t help.
By far the most common reason for a 0x9e is that Failover Clustering is conducting health monitoring between the NetFT kernel mode driver to the user mode service. If NetFT stop receiving heartbeats, then user mode is considered to be non-responsive and clustering will bugcheck the box in an effort to force a recovery.
So the next question is what caused user mode to become unresponsive? In general, you can troubleshoot this like any other user mode hang… you can setup perfmon and look for memory leaks, etc… the most valuable diagnostic tool will be that when clustering bugchecks the box, you can capture a dump and analyze it to reach root cause. This will involve a call to Microsoft support to help debug the dump.
One of the questions we receive is, what type of memory dump should we be set for (small, kernel, active, or complete)? This is a good question and not everyone takes this into consideration. As the blog has discussed, we are blue screening due to user mode processes. User mode processes are not contained within a small or kernel memory dump, but are contained within an active and complete dump. There are very few, but still a few, that can be diagnosed from a kernel dump. In order to properly follow the path to the user mode hang causing this blue screen, starting in user mode Resource Host System (RHS) is needed, necessitating an active or complete dump to be taken.
There are a couple different conditions which can invoke a bugcheck 0x9e. In this blog I will discuss the different parameters logged in the Event ID 1001 and what they mean.
Decoding STOP 0x0000009E
The bugcheck code will have the following format with the following parameters.
Stop 0x0000009E ( Parameter1 , Parameter2 , Parameter3 , Parameter4 )
Parameter1 value meaning:
Process that failed to satisfy a health check within the configured timeout
Parameter2 value meaning:
Hex value which defines the time in seconds for the timeout which was hit. This will detail how long it took for the bugcheck to be invoked.
Parameter3 value meaning:
|0x0000000000000000||The source of the reason for the bugcheck was not specified. In OS versions prior to Win2012 R2 this will always be the value.|
|0x0000000000000001||The node has been bugchecked because the RHS process was attempting to gracefully close and did not complete successfully.|
|0x0000000000000002||The node has been bugchecked because a resource did not respond to a resource entry point call within the configured ‘DeadlockTimeout’ timeout. The node was configured to bugcheck by the ‘DebugBreakOnDeadlock’ registry key being set to a value of 3.|
|0x0000000000000003||The node has been bugchecked because of an unhandled exception with one of the cluster resources and when attempting to recover the RHS process did not terminate successfully within 20 minutes.|
|0x0000000000000004||The node has been bugchecked because of an unhandled exception with the Resource Hosting Subsystem (RHS) and when attempting to recover the RHS process did not terminate successfully within 20 minutes.|
|0x0000000000000005||The node has been bugchecked because a resource did not respond to a resource entry point call within the ‘DeadlockTimeout’ timeout (5 minutes by default) and an attempt was made to terminate the RHS process to recover. However, the RHS process did not terminate successfully within the timeout, which is four times the ‘DeadlockTimeout’ timeout (20 minutes by default).|
|0x0000000000000006||The node has been bugchecked because a resource type did not respond to a resource entry point call within the ‘DeadlockTimeout’ timeout and an attempt was made to terminate the RHS process to recover. However, the RHS process did not terminate successfully.|
|0x0000000000000007||The node has been bugchecked because of an unhandled exception with the Cluster Service (ClusSvc) and when attempting to recover the ClusSvc process did not terminate successfully within 20 minutes.|
|0x0000000000000008||The node has been bugchecked by the request of another node in the Failover Cluster|
|0x0000000000000009||The node has been bugchecked because the cluster service detected an internal subcomponent of the cluster service was being unresponsive. The system was configured to bugcheck by the ‘HangRecoveryAction’ setting being set to a value of 4|
|0x000000000000000A||The node has been bugchecked because the kernel mode NetFT driver did not receive a heartbeat from the user mode Cluster Service within the configured ‘ClusSvcHangTimeout’ timeout. The recovery action was configured to bugcheck by having the ‘HangRecoveryAction’ cluster common property being set to a value of 3 (default) or 4|
Note: Parameter3 is a new value introduced in Windows Server 2012 R2 and will always be 0x0000000000000000 in previous releases.
Parameter4 value meaning:
Currently unused / reserved for future use, and will always be 0x0000000000000000
Principal PM Manager
Clustering & High-Availability