- 1 Overview
- 2 Node Drain
- 3 Node Resume with Failback
- 4 Additional Information:
- 5 Conclusion:
First published on MSDN on Apr 03, 2012
Windows Server 2012 Failover Clusters are easier to manage and maintain with the new “Node Drain” and “Resume with Failback” features. This enables nodes to be gracefully drained for planned maintenance. This functionality is part of the infrastructure that enables “Cluster Aware Updating” (CAU) for patching nodes in a cluster.
Bringing an individual node down for planned maintenance is a common administrative task, to for example install a Service Pack or hardware upgrades.
On a Windows Server 2008 R2 Failover Cluster, this is a manual process where you place a cluster node in PAUSED state, and then move individual Roles (workloads) to the other nodes in the cluster as outlined in
this KB article
In Windows Server 2012 conducting planned maintenance on Failover Clusters is dramatically simplified, as these steps are automated in the Node Drain (or Node Maintenance Mode) feature.
Using Node Drain you can automate moving the Roles (workloads) off of a cluster node. Think of Node Drain is to as an enhanced, workload aware Node Pause.
Steps automated by Node Drain:
1) The cluster node is put in a PAUSED state, which prevents other workloads hosted on other nodes from moving to the node.
2) The Roles (workloads) currently owned by the cluster node, are sorted according to their Priority order. (Priority of Roles is another new Failover Clustering functionality in Windows Server 2012.)
3) The Roles are then distributed to the other active nodes in the cluster in priority order. Node Drain works with all workloads running on the cluster. For virtual machines, it leverages live migrations and memory-aware intelligent placement.
4) When all the Roles are moved off of the cluster node, Node Drain operation is completed.
Initiating Node Drain through Failover Cluster Manager:
Initiating Node Drain through Failover Cluster Manager snap-in is a simple one-click operation:
Failover Cluster Manager
- On the left hand pane navigate to
- Right-click on the node you wish to drain
Note: If you select “Do Not Drain Roles”, then it would simply “PAUSE” the node similar to Windows Server 2008 R2.
Initiating Node Drain through PowerShell:
You can initiate Node Drain using the “Suspend-ClusterNode” PowerShell command.
There are additional advanced options available through PowerShell to manage draining nodes, which includes:
Initiates Node Drain
The destination node where all drained roles will be moved/live migrated to
Moves the roles off of the draining node even if the Group cannot move either because no other node can host this group or it is in locked state
Defines an amount of time to wait for the Node Drain operation to begin
Status of Drained Node:
When a Node Drain is initiated, the command returns the NodeDrainStatus property, indicating that the cluster node has begun the node drain operation. You can track the status of the on-going node drain operation using these two cluster node common properties:
0 – Not Initiated
This property indicates the current status of the Node Drain.
1 – In Progress
2 – Completed
3 – Failed
Cluster Node Id
ID of the cluster node which all the workload will be moved to. This ID is set when you use the TargetNode parameter.
Node Drain Failure:
Node Drain will fail if a virtual machine’s Live Migration fails due to some reason, or if a Role cannot be moved as the node being drained is the last possible owner node for the Role.
Upon encountering an error with an individual role, the node drain operation will continue to drain the remaining roles hosted on the node. The status of node drain would be set to “3” only after the remaining roles are drained from the cluster node.
Restarting Node Drain and optionally you can specify “-ForceDrain” parameter to override any errors encountered during the initial node drain.
Rebooting a Drained Node:
Once a node is drained, it will remain in the PAUSED state across reboots to prevent any roles from moving to that node, until the node is resumed. This keeps the node drained for the duration of the maintenance window.
Node Resume with Failback
When a node is drained, the cluster will remember the workload(s) that were moved off of the node. When resuming the node after maintenance, you have the option of moving back all the workload(s) to the cluster node. This will restore the cluster back to the original state it was in before the maintenance.
Steps automated Node Resume with Failback:
1) The cluster node is removed from PAUSED state – this enables workload(s) to move to this node.
2) The workload(s) that were originally drained from the node are moved back using Failback.
- If a failback policy is configured to only failback during a specific failback window, resume will honor the setting and the roles failback will be delayed until the failback window.
Resuming Node through Failover Cluster Manager:
Failover Cluster Manager
- On the left hand pane navigate to
- Right-click on the node you wish to resume
Fail Roles Back
Note: If you select “Do Not Fail Roles Back”, then it would simply “RESUME” the node similar to Windows Server 2008 R2.
Resuming Node through PowerShell:
You can resume a node using the Resume-ClusterNode PowerShell command.
There are additional advanced options available through PowerShell to manage resuming nodes, which includes:
This defines the type of failback to expect after node is resumed.
Cancelling Node Drain:
Draining a node may be a long running operation. A Node Drain that is in progress can be cancelled by initiating a Node Resume. This will cause the Node Drain operation to stop, and if Fail Roles Back is specified, the drained workloads which were moved will be moved back to the cluster node.
Configuring the Move Type for a Virtual Machine
Node Drain and Node Resume with Failback will leverage Live Migration for virtual machines so that a node can be drained with no downtime. Live Migration may at times be a long running operation, and there may be scenarios where you wish to quickly drain a node. Node draining provides the flexibility to allow configuration of how VMs should be moved, using either Live Migration or Quick Migration.
You also have the granular control to configure the move type to be used based on the priority setting of the VM. This is configured with the Virtual Machine Resource Type property private property NodeDrainMoveTypeThreshold:
Priority of Virtual Machines
Virtual Machines with Priority equal to or higher than the specified priority will be moved using Live Migration.
Virtual Machines with Priority lower than the specified priority will be moved using Quick Migration.
Node Drain is a great new time-saving feature in Windows Server 2012 Failover Clustering for conducting planned maintenance. Using this feature, you can easily drain the workload(s) off of a cluster node in a single click, and easily restore them when maintenance operations are completed on the cluster node.
Amitabh Tamhane Lokesh Koppolu
Program Manager II Principal Development Lead
Clustering & High Availability Clustering & High Availability