In this blog post https://blogs.msdn.microsoft.com/clustering/2014/12/08/troubleshooting-cluster-shared-volume-auto-pauses-event-5120/ we discussed what a Cluster Shared Volumes (CSV) event ID 5120 means, and how to troubleshoot it. In particular, we discussed the reason for auto-pause due to STATUS_IO_TIMEOUT (c00000b5), and some options on how to troubleshoot it. In this post we will discuss how to troubleshoot it using LiveDumps, which enables debugging the system with no downtime for your system.
First let’s discuss what is the LiveDump. Some of you are probably familiar with kernel crash dumps https://support.microsoft.com/en-us/kb/927069. You might have at least two challenges with kernel dump.
- Bugcheck halts the system resulting in downtime
- Entire contents of memory are dumped to a file. On a system with a lot of memory, you might not have enough space on your system drive for OS to save the dump
The good news is that LiveDump solves both of these issues. Live Dump was a new feature added in Windows Server 2012 R2. For the purpose of this discussion you can think of LiveDump as an OS feature that allows you to create a consistent snapshot of kernel memory and save it to a dump file for the future analysis. Taking this snapshot will NOT cause bugcheck so no downtime. LiveDump does not include all kernel memory, it excludes information which is not valuable in debugging. It will not include pages from stand by list and file caches. The kind of livedump that cluster collects for you also would not have pages consumed by Hypervisor. In Windows Server 2016 Cluster also makes sure to exclude from the livedump CSV Cache. As a result LiveDump has much smaller dump file size compared to what you would get when you bugcheck the server, and would not require as much space on your system drive. In Windows Server 2016 there is a new bugcheck option called an “Active Dump”, which similarly excludes unnecessary information to create a smaller dump file during bugchecks.
You can create LiveDump manually using LiveKD from Windows Sysinternals (https://technet.microsoft.com/en-us/sysinternals/bb897415.aspx ). To generate LiveDump run command “livekd –ml –o <path to a dump file>” from an elevated command prompt. Path to the dump file does not have to be on the system drive, you can save it to any location. Here is an example of creating live dump on a Windows 10 Desktop with 12 GB RAM, which resulted in a dump file of only 3.7 GB.
D:>livekd -ml -o d1.dmp LiveKd v5.40 - Execute kd/windbg on a live system Sysinternals - www.sysinternals.com Copyright (C) 2000-2015 Mark Russinovich and Ken Johnson Saving live dump to D:d1.dmp... done. D:>dir *.dmp Directory of D: 02/25/2016 12:05 PM 2,773,164,032 d1.dmp 1 File(s) 2,773,164,032 bytes 0 Dir(s) 3,706,838,417,408 bytes free
If you are wondering how much disk space you would need to livedump you can generate one using LiveKD, and check its size.
You might wonder what so great about LiveDump for troubleshooting. Logs and traces work well when something fails because hopefully in a log there will be a record where someone admits that he is failing operations and blames someone who causes that. LiveDump is great when we need to troubleshoot a problem where something is taking long time, and nothing is technically failing. If we start a watchdog when operation started, and if watchdog expires before operation completes then we can try to take a dump of the system hoping that we can walk a wait chain for that operation and see who owns it and why it is not completing. Looking at the livedump is just like looking at kernel dumps. It requires some skills, and understanding of Windows Internals. It has a steep learning curve for customers, but it is a great tool for Microsoft support and product teams who already have that expertise. If you reach out to Microsoft support with an issue where something is stuck in kernel, and a live dump taken while it was stuck then chances of prompt root causing of the issue are much higher.
Windows Server Failover Clustering has many watchdogs which control how long it should wait for cluster resources to execute calls like resource online or offline. Or how long we should wait for CSVFS to complete a state transition. From our experience we know that in most cases some of these scenarios will be stuck in the kernel so we automatically ask Windows Error Reporting to generate LiveDump. It is important to notice that LiveKd uses different API that produces LiveDump without checking any other conditions. Cluster uses Windows Error Reporting. Windows Error Reporting will throttle LiveDump creation. We are using WER because it manages disk space consumption for us and it also will send telemetry information about the incident to Microsoft where we can see what issues are affecting customers. This helps us to priorities and strategize fixes. Starting from Windows Server 2016 you can control WER telemetry through common telemetry settings, and before that there was a separate control panel applet to control what WER is allowed to share with Microsoft.
By default, Windows Error Reporting will allow only one LiveDump per report type per 7 days and only 1 LiveDump per machine per 5 days. You can change that by setting following registry keys
reg add "HKLMSoftwareMicrosoftWindowsWindows Error ReportingFullLiveKernelReports" /v SystemThrottleThreshold /t REG_DWORD /d 0 /f reg add "HKLMSoftwareMicrosoftWindowsWindows Error ReportingFullLiveKernelReports" /v ComponentThrottleThreshold /t REG_DWORD /d 0 /f
Once LiveDump is created WER would launch a user mode process that creates a minidump from LiveDump, and immediately after that would delete the LiveDump. Minidump is only couple hundred kilobytes, but unfortunately it is not helpful because it would have call stack only of the thread that invoked LiveDUmp creation, and we need all other threads in the kernel to track down where we are stuck. You can tell WER to keep original Live dumps using these two registry keys.
reg add "HKLMSoftwareMicrosoftWindowsWindows Error ReportingFullLiveKernelReports" /v FullLiveReportsMax /t REG_DWORD /d 10 /f reg add "HKLMSYSTEMCurrentControlSetControlCrashControl" /v AlwaysKeepMemoryDump /t REG_DWORD /d 1 /f
Set FullLiveReportsMax to the number of dumps you want to keep, the decision on how many to keep depends on how much free space you have and the size of LiveDump.
You need to reboot the machine for Windows Error Reporting registry keys to take an effect.
LiveDumps created by Windows Error Reporting are located in the %SystemDrive%WindowsLiveKernelReports.
Windows Server 2016
In Windows Server 2016 Failover Cluster Live Dump Creation is on by default. You can turn it on/off by manipulating lowest bit of the cluster DumpPolicy public property. By default, this bit is set, which means cluster is allowed to generate LiveDump.
PS C:Windowssystem32> (get-cluster).DumpPolicy 1118489
If you set this bit to 0 then cluster will stop generating LiveDumps.
PS C:Windowssystem32> (get-cluster).DumpPolicy=1118488
You can set it back to 1 to enable it again
PS C:Windowssystem32> (get-cluster).DumpPolicy=1118489
Change take effect immediately on all cluster nodes. You do NOT need to reboot cluster nodes.
Here is the list of LiveDump report types generated by cluster. Dump files will have report type string as a prefix.
|CsvIoT||A CSV volume AutoPaused due to STATUS_IO_TIMEOUT and cluster on the coordinating node created LiveDump|
|CsvStateIT||CSV state transition to Init state is taking too long.|
|CsvStatePT||CSV state transition to Paused state is taking too long|
|CsvStateDT||CSV state transition to Draining state is taking too long|
|CsvStateST||CSV state transition to SetDownLevel state is taking too long|
|CsvStateAT||CSV state transition to Active state is taking too long|
You can learn more about CSV state transition in this blog post:
Following is the list of LiveDump report types that cluster generates when cluster resource call is taking too long
|ClusResCO||Cluster resource Open call is taking too long|
|ClusResCC||Cluster resource Close call is taking too long|
|ClusResCU||Cluster resource Online call is taking too long|
|ClusResCD||Cluster resource Offline call is taking too long|
|ClusResCK||Cluster resource Terminate call is taking too long|
|ClusResCA||Cluster resource Arbitrate call is taking too long|
|ClusResCR||Cluster resource Control call is taking too long|
|ClusResCT||Cluster resource Type Control call is taking too long|
|ClusResCI||Cluster resource IsAlive call is taking too long|
|ClusResCL||Cluster resource LooksAlive call is taking too long|
|ClusResCF||Cluster resource Fail call is taking too long|
You can learn more about cluster resource state machine in these two blog posts:
You can control what resource types will generate LiveDumps by changing value of the first bit of the resource type DumpPolicy public property. Here are the default values:
C:> Get-ClusterResourceType | ft Name,DumpPolicy Name DumpPolicy ---- ---------- Cloud Witness 5225058576 DFS Replicated Folder 5225058576 DHCP Service 5225058576 Disjoint IPv4 Address 5225058576 Disjoint IPv6 Address 5225058576 Distributed File System 5225058576 Distributed Network Name 5225058576 Distributed Transaction Coordinator 5225058576 File Server 5225058576 File Share Witness 5225058576 Generic Application 5225058576 Generic Script 5225058576 Generic Service 5225058576 Health Service 5225058576 IP Address 5225058576 IPv6 Address 5225058576 IPv6 Tunnel Address 5225058576 iSCSI Target Server 5225058576 Microsoft iSNS 5225058576 MSMQ 5225058576 MSMQTriggers 5225058576 Nat 5225058576 Network File System 5225058577 Network Name 5225058576 Physical Disk 5225058577 Provider Address 5225058576 Scale Out File Server 5225058577 Storage Pool 5225058577 Storage QoS Policy Manager 5225058577 Storage Replica 5225058577 Task Scheduler 5225058576 Virtual Machine 5225058576 Virtual Machine Cluster WMI 5225058576 Virtual Machine Configuration 5225058576 Virtual Machine Replication Broker 5225058576 Virtual Machine Replication Coor... 5225058576 WINS Service 5225058576
By default, Physical Disk resources would produce LiveDump. You can disable that by setting lowest bit to 0. Here is an example how to do that for the physical disk resource
(Get-ClusterResourceType -Name "Physical Disk").DumpPolicy=5225058576
Later on you can enable it back
(Get-ClusterResourceType -Name "Physical Disk").DumpPolicy=5225058577
Changes take effect immediately on all new calls, no need to offline/online resource or restart the cluster.
The last group is the report types that cluster service would generate when it observes that some operations are taking too long.
|ClusWatchDog||Cluster service watchdog|
Windows Server 2012 R2
We had such a positive experience troubleshooting issues using LiveDump on Windows Server 2016 that we’ve backported a subset of that back to Windows Server R2. You need to make sure that you have all the recommended patches outlined here. On Windows Server 2012 R2 LiveDump will not be generated by default, it can be enabled using following PowerShell command:
Get-Cluster | Set-ClusterParameter -create LiveDumpEnabled -value 1
LiveDump can be disabled using the following command:
Get-Cluster | Set-ClusterParameter -create LiveDumpEnabled -value 0
Only CSV report types were backported, as a result you will not see LiveDumps from cluster resource calls or cluster service watchdog. Windows Error Reporting throttling will also need to be adjusted as discussed above.
CSV AutoPause due to STATUS_IO_TIMEOUT (c00000b5)
Let’s see how LiveDump help troubleshooting this issue. In the blog post https://blogs.msdn.microsoft.com/clustering/2014/12/08/troubleshooting-cluster-shared-volume-auto-pauses-event-5120/ we’ve discussed that it is usually caused by an IO on the coordinating node taking long time. As a result of that CSVFS on a non-coordinating node would get an error STATUS_IO_TIMEOUT. CSVFS will notify cluster service about that event. Cluster service will create LiveDump with report type CsvIoT on the coordinating node where IO is taking time. If we are lucky, and the IO has not completed before the LiveDump has been generated then we can load the dump using WinDbg to try to find the IO that is taking a long time and see who owns that IO.
Principal Software Engineer
High-Availability & Storage
To learn more, here are others in the Cluster Shared Volume (CSV) blog series:
Cluster Shared Volume (CSV) Inside Out
Cluster Shared Volume Diagnostics
Cluster Shared Volume Performance Counters
Cluster Shared Volume Failure Handling
Troubleshooting Cluster Shared Volume Auto-Pauses – Event 5120
Troubleshooting Cluster Shared Volume Recovery Failure – System Event 5142
Cluster Shared Volume – A Systematic Approach to Finding Bottlenecks