Synthetic Accelerations in a Nutshell – Windows Server 2019

Hi folks,

Dan Cuomo here for our final installment in this blog series on synthetic accelerations covering 2019.  In Server 2019, we took learnings and expanded on the work that began in Server 2012 R2 with Dynamic VMQ and Server 2016 with VMMQ, to bring Dynamic VMMQ (d.VMMQ).

The multi-release journey is designed to achieve one primary goal; improving your (and your tenant's) networking experience in the Software Defined Data Center.  This may come in the form of reducing CPU processing for traffic and/or ensuring a smooth and consistent experience for the virtual machines on your host which ultimately means happy tenants running more virtual machines (and no midnight calls to troubleshoot the all-to-common “ slow-down”

Public Service Announcement: Most of what you see below will not apply if you're using an LBFO team.  Microsoft recommends using Switch Embedded Teaming (SET) as the default teaming mechanism whenever possible, particularly when using Hyper-V.

Before we get to the good stuff, here are the pointers to the previous blogs:

Dynamic VMMQ

As a quick refresher, Virtual Receive Side Scaling (on the host) creates an indirection table which enables packets to be processed by multiple, separate processors.  The distribution of these packets to separate processors can be done in the OS, or offloaded to the NIC.  While the indirection table is always established by the OS, we can offload the packet distribution to the NIC; when offloaded to the NIC, we call this VMMQ.

Originally, we enabled the dynamic updating of the indirection table, called Dynamic VMQ, in 2012 R2.  However, in part due to the rearchitected design in 2016 to bring VMMQ, Dynamic VMQ was not available in Windows Server 2016.

Now in Windows Server 2019 we can dynamically remap VMMQ's placement of packets onto different processors.  We had three primary goals:

  • Optimize host efficiency
  • Automatic tuning of the indirection table (so the VM can meet and maintain the desired throughput)
  • Handling of bursty workloads

I'm starting to think those midnight network slow-downs may be a thing of the past!

Optimizing Host Efficiency

When network throughput is low, Dynamic VMMQ enables the system to coalesce traffic received on a virtual NIC to as few CPUs as possible; we call this queue packing because we're packing the queues onto as few CPU cores as is necessary to sustain the workload.  Queue packing is more optimal for the host as the system would otherwise need to manage the distribution of packets across more CPUs; the more CPUs are engaged, the more the system must work to ensure all packets are properly handled.

The picture below shows a virtual NIC receiving a low amount of network traffic.  You can see we're using the performance counter Virtual Switch Processor > Packets from External/sec and there is one bar for each CPU core engaged.  Only one CPU core (the green bar) is processing packets destined for a virtual NIC.  The system has coalesced or packed all the queues onto one CPU core as was necessary to sustain the workload.


Here's a video showing the Dynamic Coalescing.  Note, the video is sped up to show the process occurring a bit quicker than normal.

Automatic Tuning of the System

After a hard day's work, you head home for the day.  Little did you know, your CIO is a night-owl and a few hours later begins working right as some backups begin on the file servers hosting the user profile.

I think we all know the story that's about to unfold.  Your CIO calls in the support team after-hours because of the terrible performance.  The following day, you'll be asked to root cause what happened and develop an action plan to ensure the CIO never has this experience again.  You think to yourself

this would be about the best place in the entire world to work, if it weren't for all these complainers…” 😉

One of the challenges with VMMQ in Windows Server 2016 (Static VMMQ) is that the indirection table – the assignment of a VMQ to be processed by a specific processor – cannot be updated once established.

If another workload (for example VM B) starts receiving more traffic and one of its queues are mapped to the same processor as a queue from VM A, one of them may suffer.  This is what happened to your CIO, the queues for the file server hosting his/her user profile was on the same processors as another workload performing backups.

Note: I've seen folks try to avoid this by preventing a NIC from using the same processors used by other NICs (overlapping).  In practice, we've seen this provide very little value if any with SET teams.  First, most people misconfigure this.  Even if they have it configured correctly, you're forced into constraining your adapters to using less processors.  This only compounds the original problem.  We do not recommend changing the default RSS Processor Array (which governs the indirection table creation) unless directed by Microsoft Support.

With Windows Server 2019 and Dynamic VMMQ, we can now automatically move queues on an overburdened processor to other processors that aren't doing as much work.  Now workloads will have a more consistent and performant experience.

In the following video, (sorry, no sound) we show a running network workload.  Eventually we start a new process that competes and consumes for the CPU that is processing packets.  In Windows Server 2016, the virtual machine would start receiving less packets affecting the throughput into the VM and your sleep patterns as your CIO calls you into the office to troubleshoot.

However, in this video you can see that the system dynamically updates the indirection table and moves the processing of network traffic from CPU3 to an available processor (CPU1) when another workload starts consuming the CPU cycles.  This allows the VM to continue receiving the same amount of traffic despite having a competing workload.

Optimizing for bursty workloads

When a virtual NIC is idle, it doesn't need any receive queues.  However, if no queues are allocated (or perhaps only a bare minimum), and a burst of traffic comes in destined for that virtual NIC, it won't be able to process all the data because we can't just allocate queues all willy-nilly.  Willy-nilly is bad…

To ensure that we can meet an immediate burst of traffic, we pre-allocate queues for an idle workload.  We call this queue parking (not to be confused with core parking).

You can see the allocation of queues across a receive processor for a particular virtual NIC using the perfmon counter Virtual Network Adapter VRSS > Instance (per virtual NIC) > Receive Processor


It's important to note that there are always 16 entries shown and if you look closely, you'll note that there are two bars of the same height.  You can control how many receive queues per processor for all virtual NICs (although we recommend that you stick with the defaults) by modifying the MaxProcessors on the physical adapter.


The setting on the physical adapters cap the processors to be used by a virtual NIC.4.png

If you only want to cap certain virtual NICs then instead of setting the value on the physical adapters, just set it on the virtual NIC using Set-VMNetworkAdapter -VRSSMaxQueuePairs

Then review the updates to the vNIC as shown below.5.png

Summary of Requirements

As you can see, the requirements to implement and manage the feature are greatly reduced.

  • Install latest drivers and firmware – Dynamic VMMQ is available on Premium Certified devices with non-inbox drivers.
  • Processor Array engaged by default – CPU0  This was originally changed in 2012 R2 to enable VRSS (on the host) and you are no longer required to change the processor array (as is also the case in 2016).
  • Configure the system to avoid CPU0 on non-hyperthreaded systems and CPU0 and CPU1 on hyperthreaded systems (e.g. BaseProcessorNumbershould be 1 or 2 depending on hyperthreading) While not explicitly required any longer as the dynamic algorithm will move workloads away from a burdened core, it would still be a best practice to do this in case of a driver bug.
  • Configure the MaxProcessorNumber to establish that an adapter cannot use a processor higher than this. We recommend you let the system manage this now.
  • Configure MaxProcessors to establish how many processors out of the available list a NIC can spread VMQs across simultaneously  This is unnecessary due to the enhancements in the default queue implemented in Windows Server 2016.  You may still choose to do this if you're limiting the queues as a rudimentary QoS mechanism as noted earlier.
  • Test customer workload

Summary of Advantages

  • All the benefits of VMMQ from Windows Server 2016 (highlighted in the previous article)
  • Host efficiency is optimized – Through queue packing
  • Automatic tuning of the indirection table allows a VM to maintain stable throughput by reallocating queues to available processors
  • Handling of bursty workloads – Through queue parking

Summary of Disadvantages

  • Requires a driver update and Premium certified device

I hope you have enjoyed this series on synthetic accelerations and found it useful.  As you can see, we've steadily worked towards reducing the setup complexity, improve the stability, and increase the performance for your virtualized workloads.  Previously you had setup complicated adapters schemes, tune the system, avoid processors, and more…Now you simply install Windows and , test, and monitor.

Please let us know in the comments if you have any questions!



This article was originally published by Microsoft's ITOps Talk Blog. You can find the original article here.