This post written by Don Stanwyck, Senior Program Manager, Windows Core Networking
Remote DMA (RDMA) is an incredible technology that allows networked hosts to exchange information with virtually no CPU overhead and with extremely little latency in the end–system. Microsoft has been shipping support for RDMA in Windows Server since Windows Server 2012 and in Windows 10 (some SKUs) since its first release. With the release of Windows Server 1709 Windows Server supports RDMA in the guest. RDMA is presented over the SR-IOV path, i.e., with direct hardware access from the guest to the RDMA engine in the NIC hardware, and with essentially the same latency (and low CPU utilization) as seen in the host.
This week we published a how-to guide (https://gallery.technet.microsoft.com/RDMA-configuration-425bcdf2) on deploying RDMA on native hosts, on virtual NICs in the host partition (Converged NIC), and in Hyper-V guests. This guide in intended to help reduce the amount of time our customers spend trying to get their RDMA networks deployed and working.
As many of my readers are aware, in Windows 2012 we shipped the first version of RDMA on Windows. It supported only native interfaces, i.e., direct binding of the SMB protocol to the RDMA capabilities offered by the physical NIC. Today we refer to that mode of operation as Network Direct Kernel Provider Interface (NDKPI) Mode 1, or more simply, Native RDMA.
With Windows Server 2016 came the solution: Converged NIC operation. Now a customer who wanted to use RDMA and Hyper-V at the same time could do so on the same NICs – and even have them in a team for bandwidth aggregation and failover protection. The ability to use a host vNIC for both host TCP traffic and RDMA traffic and share the physical NIC with Guest traffic is called NDKPI Mode 2.
That wasn’t enough. Customers told us they wanted RDMA access from within VMs. They wanted the same low latency, low CPU utilization path that the host gets from using RDMA to be available from inside the guest. We heard them.
Windows Server 1709 supports RDMA in the guest. RDMA is presented over the SR-IOV path, i.e., with direct hardware access from the guest to the RDMA engine in the NIC hardware. (This is NDKPI Mode 3.) This means that the latency between a guest and the network is essentially the same as between a native host and the network. Today this is only available on Windows Server 1709 with guests that are also Windows Server 1709. Watch for support in other guests to be announced in upcoming releases.
This means that trusted applications in guests can now use any RDMA application, e.g., SMB Direct, S2D, or even 3rd party technologies that are written to our kernel RDMA interface, to communicate using RDMA to any other network entity.
Yes, there is that word “trusted” in the previous statement. What does that mean? It means that for today, just like with any other SR-IOV connected VM, the Hyper-V switch can’t apply ACLs, QoS policies, etc., so the VM may do some things that could cause some level of discomfort for other guests or even the host. For example, the VM may attempt to transmit a large quantity of data that would compete with the other traffic from the host (including TCP/IP traffic from non-SR-IOV guests).
So how can that be managed? There are two answers to that question, one present, and one future. In the present Windows allows the system administrator to affinitize VMs to specific physical NICs, so a concerned administrator could affinitize the VM with RDMA to a separate physical NIC from the other guests in the system (the Switch Embedded Team can support up to 8 physical NICs). In the future, at a time yet to be announced, Windows Server expects to provide bandwidth management (reservations and limits) of SR-IOV-connected VMs for both their RDMA and non-RDMA traffic, and enforcement of ACLs programmed by the host administrator and applying to SR-IOV traffic (IP-based and RDMA). Our hardware partners are busy implementing the new interfaces that support these capabilities.
What scenarios might want to use Guest RDMA today? There are several that come to mind, and they all share the following characteristics:
- They want low-latency access to network storage;
- They don’t want to waste CPU overhead on storage networking; and
- They are using SMB or one of the 3rd party solutions that runs on Windows Kernel RDMA.
So whether you are using SMB storage directly from the guest, or you are running an application that uses SMB (e.g., SQL) in a guest and want faster storage access, or you are using a 3rd party NVMe or other RDMA-based technology, you can use them with our Guest RDMA capability.
Finally, while High Performance Computing (HPC) applications rarely run in Guest OSs, some of our hardware partners are exposing the Network Direct Service Provider Interface (NDSPI), Microsoft’s user-space RDMA interface, in guests as well. So if your hardware vendor supports NDSPI (MPI), you can use that from a guest as well.
RDMA and DCB
RDMA is a great technology that uses very little CPU and has very low latency. Some RDMA technologies take a heavy reliance on Data Center Bridging (DCB). DCB has proven to be difficult for many customers to deploy successfully. As a result, the view of RDMA as a technology has been affected by the experiences customers have had with DCB – and that’s sad. The product teams at Microsoft are starting to say more clearly what we’ve said in quieter terms in the past:
Microsoft Recommendation: While the Microsoft RDMA interface is RDMA-technology agnostic, in our experience with customers and partners we find that RoCE/RoCEv2 installations are difficult to get configured correctly and are problematic at any scale above a single rack. If you intend to deploy RoCE/RoCEv2, you should a) have a small scale (single rack) installation, and b) have an expert network administrator who is intimately familiar with Data Center Bridging (DCB), especially the Enhanced Transmission Service (ETS) and Priority Flow Control (PFC) components of DCB. If you are deploying in any other context iWarp is the safer alternative. iWarp does not require any configuration of DCB on network hosts or network switches and can operate over the same distances as any other TCP connection. RoCE, even when enhanced with Explicit Congestion Notification (ECN) detection, requires network configuration to configure DCB/ETS/PFC and/or ECN especially if the scale of deployment exceeds a single rack. Tuning of these settings, i.e., the settings required to make DCB and/or ECN work, is an art not mastered by every network engineer.
RoCE vendors have been very actively working to reduce the complexity associated with RoCE deployments. See the list of resources (below) for more information about vendor specific solutions. Check with your NIC vendor for their recommended tools and deployment guidance.
- Jose Barreto’s 100Gb/s RDMA demo
- Claus Jorgensen’s “S2D on Cavium 41000” Blog (iWarp – RoCE comparison)
- Microsoft’s sample switch DCB configurations for RoCE
- Mellanox’s RDMA/RoCE Community page
- Your vendor’s User Guides and Release Notes for your specific network adapter