Windows Server 2025 Storage Performance with Diskspd

Spoiler

2025 is the most secure and performant release yet! Download the evaluation now!

Looking to migrate from VMware to 2025? Contact your Microsoft account team!

2025 is the most secure and performant release yet! Download the evaluation now!
Looking to migrate from VMware to Windows Server 2025? Contact your Microsoft account team!

Hi Folks – Dan Cuomo here to talk about some improvements in Diskspd measurement and the improvements you'll see in Windows Server 2025 performance.

If you manage on-premises servers, you know one of the final tests you run before going to production is a performance test. You want to ensure that when you migrate to that host, or you install on that machine, that you're going to get the expected IOPS, the expected latency, or whatever other metrics you deem important for your business' workloads.

So, after all the group policies have been applied, rules are set, agents are installed and configured (or anything else you do in your deployment playbook), you download Diskspd, NTTTCP, and other performance testing tools you use to test this server compared to your baseline (if you don't do this, you should be!).

Having this performance baseline allows you to answer questions like, “Is this ready for production” or “Is my VM performing as expected on this hardware?” Without a solid performance baseline, you simply cannot answer these questions with confidence. In Azure, we operate some of the most performance demanding workloads in the world, so it is equally important for Microsoft to understand the performance of our servers. To do this, teams across Microsoft use Diskspd, our in-house developed and publicly available storage measurement tool. We continually improve Diskspd's measurement capability so both you and our internal Microsoft teams can be confident and informed as you're running your Windows Server workloads.

In this article, we'll discuss two significant improvements (known as Batched Completions and Look-a-sides) in Diskspd measurement and what you need to know as a result. But before we begin, let's put your mind at ease. Nothing is getting worse!

To that end, you may be wondering about the genesis of these improvements. Diskspd is being updated to handle modern workloads and hardware like NVMe. Our storage stack in Windows Server 2025 was also updated to leverage advances in NVMe storage (you can hear more about the storage performance improvements in Windows Server 2025 here and here)! During our testing of these capabilities, we improved our methods of latency measurement and found that we were now hitting the disk device limits when using Windows Server 2025!

The changes outlined in this article are available in Diskspd 2.2 and later. Download now!

New: Batched Completions

First, some background. When Diskspd starts, you specify the -o parameter which indicates the number of outstanding I/O requests to keep “in-flight.” If you specify -o 1 for example, Diskspd would issue one I/O, wait for its completion, then reissue another I/O. The higher the number of outstanding I/O's, the more taxing in terms of performance requirements on the physical hardware.

Let's use an analogy to understand how Diskspd measurement accuracy is improved with batched completions.

It's that time of the day again – time to check the mailbox. You walk to the mailbox and find that there are 16 letters ready for you to pick up before you return to your home. Unless you're counting steps for fitness-tracking, you'll grab all the mail in the mailbox at one time before returning. How inefficient would it be to retrieve only one piece of mail from the mailbox, return to your home, read it, then go and get the next piece of mail from the mailbox again?! But that's how Diskspd historically worked without batched completions.

Previously Diskspd would issue the requested number of I/Os (T0), then receive and record one I/O at a time (T1), then reissue that I/O (T2) before receiving and recording the other completed I/O (T3) even though it completed at the same time. This is the equivalent of taking one letter out of the mailbox, walking back to the house, reading and writing a response to the letter, then walking back to the mailbox, and picking up the next letter . Historically, this wasn't a big problem because disks simply weren't fast enough for this issue to be observed anyway.

The processing of completed I/Os one at a time caused Diskspd to report higher storage latency than you could actually achieve on your system. Simply put, as disks have become faster, Diskspd needed a new way to track, record, and reissue completed I/Os.

DanCuomo_0-1718313502009.png

Diskspd with Batched Completions

Now, with batched completions, Diskspd will receive all completed I/Os (letters in the mailbox) and record them as soon as they complete (T1). This reflects the actual time that I/Os completed and prevents Diskspd from inflating the storage latency.

To continue the mailbox example, now we walk to the mailbox once, pick up all the mail and return back inside the house. We still respond to the mail (reissue I/Os) one at a time.

DanCuomo_0-1718314693030.png

New: Look-a-sides

Now let's imagine you're moving into a new home and have several new household items being delivered to the house. To simplify your move-in-day, you order some pizza for dinner as well.

The doorbell rings so you open your door and see the delivery truck with household items and the pizza delivery in front of your house. You take the box with all the household items, ignoring the pizza which is now sitting on your front porch getting cold, and begin to unbox everything in it. Once the box has been unpacked, you reopen your front door and pick up the pizza. For those of you that really enjoy cold pizza, this analogy might not seem like a big problem!

Diskspd recently implemented functionality called “look-a-sides” intended to address a scenario similar to the analogy above.

To understand the challenge, imagine there are 16 I/Os issued (T0) and 2 of those I/Os complete shortly after. Next, Diskspd receives I/O 1 and 2 (T1 using batched completions). While Diskspd is receiving the first set of completed I/Os, more I/Os (3 and 4) complete.

DanCuomo_2-1718313566157.png

But Diskspd doesn't record I/Os 3 and 4 as having completed yet. Instead, it continues its goal of reissuing I/Os 1 and 2. This delay in receiving and recording completed I/Os inflates the latency time measured by Diskspd unnecessarily. The more I/Os kept in-flight (the larger value for -o parameter) the more prominent this issue will become.

Diskspd with Look-a-sides

Now, with look-a-sides, Diskspd will receive I/Os 1 and 2 (T1) and begin to reissue IO 1 (T2). At the earliest possible opportunity, Diskspd will look at the completion queue to see if there are more I/Os that it can receive, and record as completed (T3).

Note: If there are no I/Os to receive, Diskspd simply moves on. In either event Diskspd continues reissuing any I/Os it has received (T4).

DanCuomo_3-1718313578503.png

Recommendation #1: Re-baseline your storage performance

Since these changes can be so dramatic, you should re-baseline your storage performance using the latest version of Diskspd. Here are comparisons we ran using some representative hardware.

DanCuomo_6-1718313735864.png

The numbers reinforce two things. First, the latency reduction is fairly dramatic regardless of the drive you use. The example on the right includes enterprise grade hardware. Next, you can see that the more IO's Diskspd is told to keep in flight (Queue Depth) the more dramatic the measurement improvement.

Recommendation #2: Test IOPS and Latency Separately

There is a chance that when Diskspd performs a look-a-side it will find no additional competed I/Os. This is sort of a “Schrödinger's cat” situation because Diskspd cannot know there are no I/O's waiting without looking in the completion queue (look-a-side) which uses a small amount of CPU resources.

Each time Diskspd performs a latency test the extra CPU used to perform the look-a-side effects the overall amount of I/O that can be pushed and lowers the reported amount of IOPS on the system. In a simple test using single thread, random 4K reads on a consumer disk, we found that IOPS reduced nearly 6% (59.5K IOPS to 56.1K IOPS) when testing latency with look-a-sides.

So, you might be asking yourself, “can I turn look-a-sides off if I just want to test IOPS?” The good news is that look-a-sides are only enabled once you specify the latency parameter (-L) with Diskspd. Therefore we recommend you perform two separate performance tests: one for IOPS (without -L) and one for latency (with -L). When using -L, your IOPS measurements will be a bit lower than the maximum achievable on the system.

Here are some example Diskspd commands for Latency and IOPS testing:

  • IOPS Testing
    Diskspd.exe -t8 -o8 -b4k -r -w0 -Suw
Note: This is only an example. You may need to try various values for -o to find the maximum.
  • Latency Testing
    Diskspd.exe -t1 -o1 -b4k -r -w0 -Suw -L
Note: With the fixes here, you could also try small increases like -o2 or -o4

Summary

To keep pace with the advances in disk speeds and the improvements in Windows Server 2025, we've made investments in our storage performance benchmark tool to get you an accurate measure of latency. These improvements were so drastic that we recommend that you run separate performance tests for latency and IOPS and re-baseline the server performance in your environment. Remember to download the latest version of Diskspd along with Windows Server 2025 evaluation.

As always, we'd love to hear your feedback below as we continue to improve these tools.

Dan “Latency Reducer” Cuomo

 

This article was originally published by Storage at Microsoft. You can find the original article here.