It’s 5 PM on Friday evening – the weekend will soon be here. You do one last sweep of your inbox before signing off when your cellphone the bat phone rings. Someone didn’t get the memo about the unwritten operational rule of IT Administration: Never make changes on a Friday. The phone itself seems terrified with every ring. A panicked voice on the other end says, “I can’t ping my VM.” Pandemonium ensues…
Today we’re going to talk about a new, free, downloadable tool that can help.
Networks are complex. There are many different vendors, with many different configurations – Even your network team might be different than your Server/HCI team. In the revelry mentioned above, everything may look the same on your hosts, but it’s hard to know if the issue is caused by the host or the physical network without being able to see the physical network configuration.
If LLDP is enabled on your switchports, it can be an easy task to quickly validate some of the physical network settings. LLDP or Link Layer Discovery Protocol is an IEEE standard (802.1AB) that allows networked devices to advertise their configuration (among other things) to neighboring devices. To Windows and Azure Stack HCI, the neighboring device is the physical switchport that its connected (via the NIC). LLDP’s Wikipedia site has a nice intro where you can learn a bit more.
With LLDP, switchports can advertise the VLAN, MTU, and DCB configuration among others information which can be critical information for Azure Stack HCI systems. However, not all switches support advertisement of the same information. Without getting into the details (which you can read more about on the Wikipedia site linked above), the switch will determine how much information you can view.
To improve Azure Stack HCI reliability where we have a purpose-built OS, we have begun to require that switches support LLDP. Most importantly, we require that they support some of the “organizationally specific Custom TLVs.” That is a fair amount of jargon, but it boils down to supporting capabilities like VLAN, MTU, etc. In the picture below, you can see the Organizationally Specific TLVs (type 127) along with the MTU and PFC configuration of the switchport this NIC is attached.
Note: We intend to grow the list of required TLVs over time as we work with network vendors. Check the Custom TLVs documentation link just above for updates.
Help! I need to buy a switch for Azure Stack HCI!
We document some Network Switches for Azure Stack HCI that the vendor has verified meet the requirements – the list will grow as we hear from the various switch vendors. Talk to your Network Vendor to see if your switch meets the requirements.
Having this information at your disposal can help you answer several critical questions particularly when you want to get started on your weekend:
- Did you misconfigure your host or is it the physical network?
- Did the network engineer add the necessary configuration to the correct switchport?
- Is the switchport configuration the same on each team member?
- Is the switchport configuration the same between each cluster node?
Help! My Network Admin says LLDP is insecure!
LLDP does not require credentials to receive information but that doesn’t mean it’s insecure. LLDP allows the administrator of the network device to choose which information (TLVs), if any, is sent to neighbors with the intention that this information can be used for diagnostic purposes.
Back to our IT hero for a moment. How can you quickly determine whether the issue is on the switch or you missed some settings on your host?
An LLDP enabled switchport will periodically (typically every 30 seconds) send messages to its neighbors, including the juicy information you may want as an IT Administrator to determine whether your physical host configuration matches that of the physical network.
Retrieving this information is traditionally a bit of a challenge, however there is a tool to make this simple.
Note: If you’re not in control of your network switches, make sure you ask your network team to enable LLDP and any “organizationally specific TLVs” that the switch supports.
Install the Module
First install the DataCenterBridging module from the PowerShell gallery. This module contains a few goodies and has been updated to include the functions to parse the LLDP data from the switch.
There are four available commands at the time of writing:
Getting the Physical Switch Information
Let’s start off by trying to get the LLDP information using Get-FabricInfo. With each of the commands you can specify the SET Switch or individual Interface names (using the InterfaceName parameter). In this case, we are specifying the SET Switch that starts with the name Converged. The cmdlet finds all the physical NICs attached the switch and looks for available LLDP messages on each interface.
At first run, it probably will not find anything. The cmdlet tells you to run Test-FabricInfo to help identify the problem.
Running Test-FabricInfo identifies a few problems that we need to resolve.
You can use Enable-FabricInfo to resolve all the problems in one shot. This will install the feature and ensure that the LLDP agent is enabled on the underlying interfaces, etc.
Note: Want to know everything this is doing? Look at the code on GitHub!
Next, run Test-FabricInfo again to determine if all the requirements are met. You can see we got a little better. Only two remaining issues; we didn’t find any LLDP packets for the interfaces in the SET switch.
Wait about 30 seconds – the typical amount of time that a switchport will send LLDP packets – and try again. If you still fail after the messages above, contact your network administrator and ensure that LLDP is enabled on the switchports connected to your team members.
If LLDP is enabled on your physical switch, you will see the following below which indicates that Test-FabricInfo found an LLDP message from the physical switch for each member of the Converged team.
Now we are ready to run Get-FabricInfo. Make sure you put the output into a variable, so you can inspect it. In this case, we add everything to the $FabricInfo variable which has an object for every team member.
You can walk the individual team members to see information collected on the Windows or HCI host (under InterfaceDetails) or the physical switch (Fabric) to which the NICs are connected.
For example, here’s a look at the IP and Subnet information on pNIC01. We collect this so it’s easy to compare to the information collected from the switch. As you can see, we have the IP Address, Subnet, VLAN, etc.
In this case, we have a virtual switch on the host and as part of the storage configuration on this system, we have a team mapped host vNIC. The IP, Subnet, etc. are being displayed from that team-mapped host vNIC. If the team member isn’t part of a virtual switch, we’ll display the configuration on the physical NIC.
Now let’s take a look at what the switch sent us and what we can learn about the physical network (as mentioned before, the information will vary based on what the switch supports):
- NativeVLAN: 1133 – Untagged traffic will be sent over this vlan
- VLANID: Info Not Provided… This includes the trunked VLANs that can be carried on this switchport. The switch below did not include this information in the packet sent to the host.
- FrameSize: 9236 – The physical NIC and virtual NICs MTU configuration should not exceed the switches value or traffic will be segmented (or in some cases dropped).
- PFC is enabled on Priority3 – Data needing lossless communication (e.g. RoCE-based RDMA) should use Priority 3.
From this information, we can determine that VLAN 711 (on the storage vNIC) is not using the native vlan, and the switch is not showing the trunked VLANs in LLDP either. This leads to two conclusions:
- We should check the switch configuration or contact our network administrator if network connectivity is not available on pNIC01 because we could not confirm that traffic is available here.
- We should ask our network administrator to find us a switch that does advertise this information so that we can identify this problem ourselves (and without ruining their weekend).
Here’s the same view but from another switch. This switch did not send the PFC information, but it does show the VLAN IDs available to the host (1, 11, 12, and 40).
From here, we can tell that VLAN 711 is not available on the physical network which is at least one obvious reason why there may not be network connectivity on this link.
Some of the other problems on the physical network that you can easily identify:
- Missing VLANs
- Misconfigured Jumbo Frames
- Misconfigured PFC settings
- Topology problems e.g. cabled to the wrong switch (check ChassisGroups for this information)
Reminder: The information displayed is dependent on the switch’s capabilities. If the switch is unable to provide us with a certain TLV, we display the text “Information Not Provided By Switch.” If you see this message, you should work with your network administrator to identify if the information can be included.
Get-FabricInfo allows you to answer several questions about the physical network configuration that may come in handy when troubleshooting diagnostic issues. Is the physical network setup as I expected it? Is the configuration the same between cluster nodes? All of this and more can be answered if your switch supports LLDP and you’re running Windows or Azure Stack HCI.
Hopefully that Friday afternoon call isn’t quite so scary anymore!
Thanks for reading,
Dan “weekend warrior” Cuomo
Thank you for these valuable insight and sharing this like. However honestly performing these tasks and troubleshooting would normally cause the IT Administrator to stay up to 8:00 PM on Friday because from experience you find one issue and then there is another issue and so on. But the good news is with this tools IT Administrator might be able to leave the office by 8:00 PM instead of force to work on the Weekend.
I hope in next step, there would be a nice GUI-based tools , for example something like Operation Manager kind of style where the administrator could quickly identify and fix issues.
© Microsoft. This article was originally published by Microsoft's Networking Blog. You can find the original article here.