In our first article we introduced Network HUD as a new feature that proactively identifies and remediates operational networking issues on Azure Stack HCI. We also discussed Network HUD’s unique on-premises cloud-service model which enables us to bring new features and capabilities (more than just bug fixes) rapidly through what we call, “content updates.”
Well, it’s official. The November content update has arrived! So, in this article, we’ll dive into the new capabilities that Network HUD gains with the November content update.
This content update (version 1.0.0 and later), includes:
- Detection of PCIe bandwidth oversubscription
- Detection of an unstable adapter that is frequently disconnecting
- Detection of an unstable adapter that is frequently resetting
- Detection of inbox drivers or out-of-date drivers
- Detection of missing Network ATC intent types.
All alerts can be found anywhere cluster health faults light up including Insights in the Azure Portal or Windows Admin Center as shown here.
How to get the latest content
Getting the latest content is easy. First make sure Network HUD is installed on each node in the cluster with:
Install-WindowsFeature -Name NetworkHUD
Next, run the command to download the November content:
Install-Module -Name Az.StackHCI.NetworkHUD -Force
As mentioned in our last blog, we do expect to improve this experience in the future, so you won’t need to manually update content.
Here’s a bit more on each of those detection capabilities included in the content update including one video sneak-peak!
Detecting PCIe Bandwidth oversubscription
If you have built your own PC, you know that PCI Express slots come in varying sizes, each with a specific amount of data that it can transfer at one time. With adapter speeds continuing to increase and multiple ports on the same card, it’s becoming common place to have network cards where the link speeds of the ports on the card exceeds the amount of PCIe bandwidth available to it.
Take for instance the following system configuration:
- pNIC01 and pNIC02 are in Slot 6 (dual-port NIC in the same slot)
- pNIC11 and pNIC12 are in Slot 7 (dual-port NIC in the same slot)
- Both adapters are 100 Gbps with an aggregate bandwidth of 200 Gbps in each PCIe slot
- Slot 6 is a PCIe x16
- Slot 7 is a PCIe x8
If you do the math, slot 7 can only transfer 128 Gbps (A realistic maximum is significantly lower than this given various overhead incurred but what’s a few gigabits between PCIe friends :smiling_face_with_smiling_eyes:). Not using 100 Gbps adapters? You’re not out of the clear. This can happen in a variety of configurations and speeds even with 25 or 50 Gbps adapters.
Here’s a video that demonstrates the issue detected by Network HUD
Detecting an unstable adapter that is resetting or disconnecting
It happens. You see some strange behavior on your servers, but you don’t think too much about it. Next the support calls start flowing in with claims like, “my app is slower than it used to be…”, “sometimes my VMs seems to stop sending traffic…”, or other anecdotal claims.
When you start to investigate, you might run Get-NetAdapter and see all adapters are connected.
Leaving no stone unturned, you recheck your adapters a little while later. Now one of the adapters in the team shows a disconnected status.
To your disbelief, a minute later the status reports up again!
This is a classic “flapping NIC” scenario (which is unfair to the NIC because it might not be its fault!). If the physical link disconnects for any reason (cable, switchport, or NIC issue) the symptoms may be the same as the reports mentioned above. A NIC failover (due to instability) can manifest to the users as higher latency (slower VMs/containers), intermittent connectivity, or inconsistent application performance as the number of VMs on that host are asking for more throughput than the remaining adapters can provide.
To troubleshoot the issue, you crawl through the mire of event logs and start troubleshooting only to find that this has been going on for quite some time!
While this may only happen intermittently, Network HUD will track these events (disconnections and resets). If it determines this event is occurring too frequently, it will fire a cluster health fault to alert you to the issue. If there are enough stable adapters remaining, Network HUD will attempt to increase stability of the servers by removing the adapter from service.
Here’s a video that demonstrates this issue detected by Network HUD
Network drivers can be a real challenge to manage. Getting the right driver for your adapter brings performance, stability, and feature improvements that make your applications fly. Get the wrong driver, and it’s the support queues following an all-nighter trying to bring the cluster back-up.
Here’s a couple of ways Network HUD helps you identify the right drivers.
Inbox drivers are ONLY used for getting basic connectivity to the internet so you can download an updated driver if needed. Inbox drivers do not include advanced functionality that let improve the customer experience. For this reason, inbox drivers are NOT supported for production use.
Network HUD will automatically identify and alert you when your adapters are using an inbox driver rather than you having to check all of your cluster nodes.
Over time, stable drivers become less stable. As strange as this may sound on the surface, this is caused by updates in software used on your system. As Microsoft updates the operating system to address security or stability bugs, new drivers must be released in coordination.
How many times have you heard the phrase, “update your driver” to solve a hardware problem? This IS NOT because the previous driver was inherently unstable. Updates included in the new driver allow the device to function properly with the most recent code-base.
Rest assured, Network HUD will remind you when it’s time to update your drivers. Rather than silently allow your systems to head towards an outage, it will inform you as your driver begins to age. First, we’ll send you a warning and give you some time to make the necessary upgrades. After a while, we’ll sound the alarm bells to ensure you know the risks.
Detection of missing Network ATC intent types
In our introduction blog, we highlighted that Network HUD is cluster aware and can manage network stability across the entire cluster. In other words, nodes in the cluster understand what issues other nodes are experiencing and can accommodate for this.
This also lets us run only the needed tests on your adapters. Rather than run every test on every adapter, Network HUD leverages Network ATC to understands contextually the intent types (types of traffic) those adapters were configured for and only runs those tests. This conserves resources on your system and provide more clear alerts.
If you’re missing one or more intent type, Network HUD will let you know so you can correct the problem and better protect your system.
This month’s content update focused on identification of performance and stability issues that affect the workloads running on Azure Stack HCI. I hope you’ve found this blog and videos useful. Give the new content a try and let us know if you have any questions!
Thanks again for reading!
Dan “support case eliminator” Cuomo