In today’s cloud scale environments, commonly comprising of commodity hardware, transient failures have become more common than hard failures. In these circumstances, reacting aggressively to handle these transient failures can cause more downtime than it prevents. Windows Server 2016, therefore introduces increased Virtual Machine (VM) resiliency to intra-cluster communication failures in your compute cluster.
Interesting Transient Failure Scenarios
The following are some potentially transient scenarios where it would be beneficial for your VM to be more resilient to intra-cluster communication failures:
- Node disconnected: The cluster service attempts to connect to all active nodes. The disconnected (Isolated) node cannot talk to any node in an active cluster membership.
- Cluster Service crash: The Cluster Service on a node is down. The node is not communicating with any other node.
- Asymmetric disconnect: The Cluster Service is attempting to connect to all active nodes. The isolated node can talk to at least one node in active cluster membership.
New Failover Clustering States
In Windows Server 2016, to reflect the new Failover Cluster workflow-in the event of transient failures, three new states have been introduced:
- A new VM state, Unmonitored , has been introduced in Failover Cluster Manager to reflect a VM that is no longer monitored by the cluster service.
- Two new cluster node states have been introduced to reflect nodes which are not in active membership but were host to VM role(s) before being removed from active membership:
- The node is no longer in an active membership
- The node continues to host the VM role
- The node is no longer allowed to join the cluster for a fixed time period (default: 2 hours)
- This action prevents flapping nodes from negatively impacting other nodes and the overall cluster health
- By default, a node is quarantined, if it ungracefully leaves the cluster, three times within an hour
- VMs hosted by the node are gracefully drained once quarantined
- No more than 25% of nodes can be quarantined at any given time
- The node can be brought out of quarantine by running the Failover Clustering PowerShell © cmdlet, Start-ClusterNode with the –CQ or –ClearQuarantine flag.
VM Compute Resiliency Workflow in Windows Server 2016
The VM resiliency workflow in a compute cluster is as follows:
- In the event of a “transient” intra-cluster communication failure, on a node hosting VMs, the node is placed into an Isolated state and removed from its active cluster membership. The VM on the node is now considered to be in an Unmonitored state by the cluster service.
- File Storage backed (SMB): The VM continues to run in the Online state.
- Block Storage backed (FC / FCoE / iSCSI / SAS): The VM is placed in the Paused Critical state. This is because the isolated node no longer has access to the Cluster Shared Volumes in the cluster.
- Note that you can monitor the “true” state of the VM using the same tools as you would for a stand-alone VM (such as Hyper-V Manager).
- If the isolated node continues to experience intra-cluster communication failures, after a certain period (default of 4 minutes), the VM is failed over to a suitable node in the cluster, and the node is now moved to a Down state.
- If a node is isolated a certain number of times (default three times) within an hour, it is placed into a Quarantine state for a certain period (default two hours) and all the VMs from the node are drained to a suitable node in the cluster.
Configuring Node Isolation and Quarantine settings
To achieve the desired Service Level Agreement guarantees for your environment, you can configure the following cluster settings, controlling how your node is placed in isolation or quarantine:
Defines how unknown failures handled
1 – Allow the node to be in Isolated
2- Always let a node go to an Isolated state and give it time before taking over ownership of the VMs.
Duration to allow VM to run isolated (in seconds)
0 – Reverts to pre-Windows Server 2016 behavior
Group common property for granular control:
Note: A value of -1 for the group property causes the cluster property to be used.
Number of failures before a node is Quarantined.
Duration to disallow cluster node join (in seconds)
0xFFFFFFFF – Never allow node to join (in seconds)