Load Testing RAG based Generative AI Applications

When developing applications for Language Models (LLMs), we usually spend a lot of time on both the development and evaluation phases to ensure the app delivers high-quality responses that are not only accurate but also safe for users. However, a great user experience with an LLM application isn't just about the quality of responses—it's also about how quickly these responses are provided. So, in this discussion, we'll focus on how to evaluate LLM applications for their response times.
The aim of performance evaluation is to proactively test the application to identify and address performance issues before they impact end-users. In the subsequent sections, we will explore performance evaluation in detail. We will discuss building an effective strategy, mastering evaluation techniques, and provide practical guides. Here's what you can expect:

 

Building an Effective Strategy

Each application has unique characteristics, such as user count, transaction volume, and expected response time. Therefore, it's crucial for you to establish an effective evaluation strategy tailored to the specific application you're evaluating.

 

Before initiating the tests, you need to outline your strategy, which includes determining the aspects to test and the methods to use. This section provides a detailed discussion on these considerations.

 

Identifying What to Evaluate

 

Let's start by defining what you are going to test. For example, if the application is fully implemented and running in an environment similar to production, you can conduct a comprehensive load test. This allows you to measure performance and anticipate the user experience before the application is released to end users.

 

Testing the entire application is a good idea as it provides a measure of response times that closely mirrors what a user will experience when interacting with the application. However, a user's interaction with a Large Language Model (LLM) App involves several elements. These include the application frontend, backend, networking, the LLM model, and other cloud services like databases and AI services.
With that in mind, you have the chance to conduct performance evaluations on particular services even before the whole application is finalized and prepared for deployment. For instance, you can proactively assess the performance of the Large Language Model (LLM) that you plan to incorporate in the application, even if it isn't fully ready yet. In the Mastering Evaluation Techniques section, you will see how you can test the performance of a model deployed in the Azure OpenAI service.
Let's examine an example of an application structure where multiple services cooperate to deliver a user response. One frequently used architectural design in Large Language Model (LLM) Applications like ChatGPT is the Retrieval Augmented Generation (RAG). This architecture involves a retrieval step before invoking the LLM to generate content, which is essential for providing grounding data.
The Enterprise RAG Solution Accelerator architecture offers a practical example of the RAG pattern implemented in an enterprise setting. In the How-To Guides section, you can find an example of load testing on an LLM application that uses the RAG pattern. The following image illustrates the orchestration flow within an LLM application based on RAG. For simplicity, we have not depicted the return messages in the diagram.

 

perftest-GPT-RAG-Basic-communication.png
Example of communication between the components of an LLM App based on the RAG pattern.

 

Here's how it works:
  1. The user interacts with the frontend UI to pose a question.
  2. The frontend service forwards the user's question to the Orchestrator.
  3. The Orchestrator retrieves the user's conversation history from the database.
  4. The Orchestrator accesses the AI Search key stored in the Key Vault.
  5. The Orchestrator retrieves relevant documents from the AI Search index.
  6. The Orchestrator uses Azure OpenAI to generate a user response.
Optional connections (dashed arrows):
  • The connection from the App Service to Storage Account indicates the scenario when the user wants to view the document that grounds the provided answer.
  • The connection from the App Service to Speech Services indicates the cases when the user wishes to interact with the application through audio.
Each step in the process involves data transfer and processing across various services, all contributing to the total response time. In such scenarios, you can evaluate not just the overall application response time, but also the performance of individual components, like the response times of the Azure OpenAI model deployment.
Ultimately, the scope of testing depends on each application's specific requirements. For instance, an internal application and a public-facing application may have different performance needs. While a response time of 15 seconds might be acceptable in an internal HR application used to view paychecks, a contact center app with hundreds of users might need to respond much faster due to its high user demand and SLAs. Make sure you know the requirements of your application before starting your performance evaluation.

Test Scenario

After determining what you will evaluate in your application, it's essential to define your test scenario. During the LLM App development, we often start our performance tests in a simple way, with a single request. This first test might include measuring the time it takes for completions to appear in the playground, gauging the time a Prompt flow takes to execute in AI Studio, or calculating the time necessary for code execution in VS Code. The data from this single run can help detect potential performance issues in specific components early on.
However, for a genuine perspective on the application's performance, you will need to establish a test scenario that mirrors the actual user load on the application. This entails conducting tests under circumstances that closely resemble real-world usage. By doing so, your assessments will accurately depict how the application would perform under standard operational conditions. We will offer some suggestions on how to accomplish this shortly.
First, we need to determine the load that will be placed on the application. This load is defined in terms of throughput, which is the number of requests the application will receive within a specific time frame, such as Requests per Minute (RPM).
There are multiple ways to estimate the expected throughput. If the application is already operational, you can use its current usage data, gathered from monitoring tools, as a reference.  The subsequent figure illustrates this approach. If you foresee an increase in usage due to the integration of LLM into the solution, you should adjust your throughput estimation to accommodate this anticipated growth.
perftest-users-per-hour.png
Example of usage scenario, see the peak load from 10h to 13h hours.
When dealing with a new application, estimating the expected throughput can be approached through benchmarking or usage modeling. Benchmarking involves comparing your application with similar ones that serve the same target audience. By studying their usage patterns, you can get a rough estimate of the expected throughput for your application.
Usage modeling, on the other hand, requires you to create a model of the expected usage patterns of your application. This can be achieved by interacting with stakeholders or potential users in the specific field for which the application is being developed. Their insights can provide a better understanding of how the application might be used, which can assist in estimating the Requests Per Minute (RPM).
One approach to model your application usage is starting by identifying the total number of users. This should encompass all registered, or potential users of your application. Then, identify the number of these users who are active during peak usage times. Focusing on peak usage times is crucial as off-peak data may not accurately reflect system performance, particularly for systems with distinct high usage periods.
Next, estimate the average number of times a user will use the application during peak times. This is referred to as sessions. Also, estimate the number of actions or interactions a user makes during a session. Each interaction corresponds to a request made to the application.
For example, consider a mobile app for a shopping mall. Users are likely to have multiple interactions in a single session. They might first search for recommended restaurants, then ask about the operating hours of a specific restaurant. Each of these actions is an interaction.
perftest-sample-sequence-diagram2.png
Example of a user session.
Once you have the total number of users (u), the percentage (p) of them that will use the application during the peak usage hours (n), the average number of user sessions (s), and the typical number of interactions (i) each user has per session, you can use the following formula to derive the RPM to run your load test.
RPM = (u * p * s * i) / n / 60
Taking the previous example of the Mall App, let's consider a set of 10,000 registered users on the App. We expect that during peak hours, 10% of the users will interact at least once with the application to obtain information, such as store locations or product details.

In this case, we have:
  • u=10000 (total users)
  • p=0.1 (percentage of active users during peaktime)
  • s=1 (sessions per user)
  • i=2 (interactions per session)
  • n=1 (peaktime duration in hours)
Therefore, the expected throughput for the peak hours is approximately 17 RPM.

Note: During testing, you may want to reproduce a load that is about 10% higher than estimated to be more conservative.

Defining the scenario for larger applications can become complex, requiring the identification of distinct user roles and their usage behavior. However, if exact modeling is challenging due to lack of information, just keep things simple, make an educated guess, and validate it with the application's stakeholders.

Another factor to consider when defining your test scenario is that LLM response times depend on the sizes of prompts and completions. Accurate testing requires replicating real usage scenarios, matching these sizes. For instance, RAG prompts are typically larger due to context text, while proofreading apps usually have similar prompt and completion sizes.
It's worth noting that the examples provided here are primarily based on the perspective of a real-time application user. However, it's also possible to design other performance test scenarios, such as for applications operating in batch mode. In this case, for example, one would need to consider the volume of data to be processed within a specific time window to determine if the application can handle the load within the given time frame.
Whether your application is real-time or batch-oriented, it's crucial to accurately define your test scenario to reflect actual usage conditions. This will ensure that your performance tests yield meaningful results that can help optimize your application's performance.

Test Data

Performance testing heavily relies on the data used during the test execution. It's crucial to use data that closely mirrors real-world scenarios. For instance, if we're testing an application like a copilot, the test results would be more accurate if each user asks different questions. Even if users ask the same questions, they should phrase them differently.

Consider a scenario where each virtual user asks the exact same question during the test. This could lead to results that don't accurately represent real-world usage. Certain components of the application might leverage caching mechanisms to deliver faster responses, skewing the results. Furthermore, the final metric, typically an average or a percentile, will be biased towards the repeated question.

Having experts in the App domain contribute to the creation of the test data set can greatly enhance its quality and relevance. Their knowledge can help shape more realistic and relevant examples. Alternatively, a Large Language Model (LLM) can be utilized to generate a synthetic dataset. This approach can be particularly useful for running tests, as it allows for the creation of diverse and comprehensive data scenarios. This practice not only enhances the quality of the tests but also ensures that they cover a wide range of potential real-world situations.

Test Measurements

Performance testing requires identifying key metrics. A common one is Response Time, the total time from sending a request to receiving a response. Performance requirements are typically determined by this metric, such as needing an average response time under ten seconds, or 95% of responses within ten seconds.

However, response time is not the only metric of interest. To gain a holistic understanding of the application's performance and the factors affecting it, we can categorize the metrics into two groups.

The first group comprises metrics that can be measured from the client's perspective – what the client can observe and capture. The second group consists of metrics that are measured by monitoring the server's performance. Let's explore each of these groups in detail:

Client metrics

When testing an LLM App, we usually obtain the following client metrics:
Metric Description
Number of Virtual Users This metric shows the virtual user count during a load test, helping assess application performance under different user loads.
Requests per Second This is the rate at which requests are sent to the LLM App during the load test. It's a measure of the load your application can handle.
Response Time This refers to the duration between sending a request and receiving the full response. It does not include any time spent on client-side response processing or rendering.
Latency The latency of an individual request is the total time from just before sending the request to just after the first response is received. 
Number of Failed Requests This is the count of requests that failed during the load test. It helps identify the reliability of your application under stress.
Note: Response time, often referred to as “end-to-end response time,” includes all processing time within the system. However, this measurement only covers the interval between sending a request and receiving the response. It does not account for any client-side processing time, such as rendering a webpage or executing JavaScript in a web application.
Note: Latency and response time are terms that are often used interchangeably, and their definitions can vary depending on the context. In the Azure OpenAI documentation, latency is defined as “the amount of time it takes to get a response back from the model.” For consistency, we will use the definition provided in the previous table, which is based on the concepts from Azure Load Testing.
The diagram below illustrates how processing and communication times add up to the total response time. In the figure, each Tn marks a specific time during processing. T1 is when the user initiates a request through a client, such as a browser or MS Teams. T10 is when the user gets the full response. Note that the total response time (from T1 to T10) depends on the processing and response times of all components involved in the request.

perftest-response-time.png

Simplified example of the breakdown of request response time.

Performance Metrics for a LLM

When conducting performance testing directly on a specific service, we can collect specific client-side metrics for the target service. In the context of performance testing a Language Model (LLM), we should consider metrics related to prompt tokens and response tokens. For instance, consider the deployment of an OpenAI model on Azure. The following table presents some of these metrics, which offer valuable insights into the client's interaction with the model deployment and its performance under load.

Metric Description
Number Prompt Tokens per Minute Rate at which the client sends prompts to the OpenAI model.
Number Generated Tokens per Min Rate at which the OpenAI model generates response tokens.
Time to First Token (TTFT) The time interval between the start of the client's request and the arrival of the first response token.
Time Between Tokens (TBT) Time interval between consecutive response tokens being generated.
Note: To examine the time intervals between tokens in Azure OpenAI's responses, you can utilize its streaming feature. Unlike conventional API calls that deliver the entire response at once, streaming sends each token or a set of tokens to the client as soon as they are produced. This allows for real-time performance monitoring and detailed analysis of the dynamics of response generation.

The diagram below provides a simplified view of a client's interaction with a model endpoint. The interaction commences at the moment (T0) when the client sends a request to the model's endpoint. The model responds in streaming mode, with T1, T2, and TN representing the moments when the first, second, and last tokens are received, respectively.

perftest-aoai-response-time.png
AOAI deployment response in streaming mode.
In this scenario, we define several key metrics: Time to First Token (TTFT) is T1 – T0, Time Between Tokens (TBT) is T2 – T1, and the end-to-end response time is TN – T0. It's important to note that in streaming mode, the model's responses can arrive in multiple parts, each with several tokens. This makes both the diagram and the metrics an approximate representation of real-world scenarios.
Server metrics
During performance testing, we focus on two types of metrics. The first type is client metrics, which directly affect the user experience. The second type is server metrics, which give us insights into the performance of server components.

Server metrics encompass a wide range of measurements. For instance, we might look at the CPU and memory usage of the application service running the frontend. We could also monitor the utilization of resources like the Azure OpenAI PTU deployment. These are just a few examples; there are many other server metrics we could potentially examine.

By collecting these measurements, we can create a detailed performance profile of the entire solution. This profile helps us identify any bottlenecks and tune any components that are not performing optimally. LLM Apps consist of various services, and the server metrics we utilize will vary based on these services. To give you an idea, here are some examples of the metrics we might gather, depending on the specific service in use:

Service Name Metric Description
Azure OpenAI Azure OpenAI Requests Total calls to Azure OpenAI API.
Azure OpenAI Generated Completion Tokens Output tokens from Azure OpenAI model.
Azure OpenAI Processed Inference Tokens The number of input and output tokens that are processed by the Azure OpenAI model.
Azure OpenAI Provision-managed Utilization V2 The percentage of the provisioned-managed deployment that is currently being used.
Azure App Service CPU Percentage The percentage of CPU used by the App backend services.
Azure App Service Memory Percentage The percentage of memory used by the App backend services.
Azure Total Requests Number of requests made to .
Azure Provisioned Throughput The amount of throughput that has been provisioned for a container or database.
Azure Cosmos DB Normalized RU Consumption The normalized request unit consumption based on the provisioned throughput.
Azure API Management Total Requests Total number of requests made to APIM.
Azure API Management Capacity Percentage of resource and queue usage in APIM instance.

When should I evaluate performance?

You might be wondering when to execute performance tests. To help us in this discussion, let's take a look at the Enterprise LLM Lifecycle, illustrated in the following image.

perftest-llmlifecycle.png

Enterprise LLM Lifecycle.

Performance testing is crucial and should start as early as possible during the development process. This early start provides enough time for making necessary adjustments and optimizations. The exact timing, however, depends on what aspects of the application you're testing.

If your goal is to evaluate the performance of the entire LLM App before it's used by end-users, the application must be fully developed and deployed to a staging environment. Typically, this load testing of the LLM App occurs during the initial iterations of the Operationalization loop in the Enterprise LLM Lifecycle.

Keep in mind that there are scenarios where performance evaluations can be conducted before Operationalization. For instance, during the Experimenting and Ideating phase, you might be exploring various LLMs for use. If you're considering using one of the models available on Azure OpenAI, this could be an excellent time to conduct a performance benchmark test using the Azure OpenAI benchmarking tool.

The following figure illustrates the moments in the LLM lifecycle where the two types of performance tests mentioned earlier are usually conducted.

perftest-llmlifecycle-with-tests.png
Performance tests in the LLM Lifecycle.

Mastering Evaluation Techniques

Great job on your journey so far in learning the essentials of your testing strategy! As we proceed in this section, we will be examining two distinct evaluation techniques. The first technique will concentrate on the performance testing of the entire LLM application, while the second will be primarily focused on testing the deployed LLM. It's important to remember that these are just two popular instances from a wide-ranging list. Depending on your unique performance requirements, integrating other techniques into your testing strategy may prove beneficial.

LLM App Load Testing

Azure Load Testing is a fully managed load-testing service that enables you to generate high-scale LLM App load testing. The service simulates traffic for your applications, regardless of where they're hosted. You can use it to test and optimize application performance, scalability, or capacity of your application. You have the flexibility to create and execute load tests either through the Azure portal or via the Azure Command Line Interface (CLI), managing and running your tests in the way that suits you best.

Azure Load Testing helps you simulate a large number of users sending requests to a server to measure how well an application or service performs under heavy load. You can use Apache JMeter to set up and run these tests. These can act like real users, doing things like interacting with the service, waiting, and using data. In the How-To Guides section, you will find a guide on how you can test your LLM App with a practical example.

The diagram below shows the high-level architecture of Azure Load Testing. It uses JMeter to simulate heavy server loads and provides detailed performance metrics. You can adjust the number of test engine instances to meet your load test requirements, making the system scalable and robust.

perftest-azure-load-testing.png
Azure Load Testing Overview.

LLM App load testing is crucial for identifying performance issues and ensuring that your application and its Azure dependencies (like the App Service, Function App, and Cosmos DB) can handle peak loads efficiently.

The following table offers an explanation of important concepts associated with Azure Load Testing. Grasping these concepts is essential for effectively using Azure's load testing features to evaluate the performance of the LLM App under various load scenarios.

Concept Description
Test Refers to a performance evaluation setup that assesses system behavior under simulated loads by configuring load parameters, test , and target environments.
Test Run Represents the execution of a Test.
Test Engine Engine that runs the JMeter test scripts. Adjust load test scale by configuring test engine instances.
Threads Are parallel threads in JMeter that represent virtual users. They are limited to a maximum of 250.
Virtual Users (VUs) Simulate concurrent users. Calculated as threads * engine instances.
Ramp-up Time Is the time required to reach the maximum number of VUs for the load test.
Latency The latency of an individual request is the total time from just before sending the request to just after the first response is received. 
Response Time This refers to the duration between sending a request and receiving the full response. It does not include any time spent on client-side response processing or rendering.
Azure Load Testing allows for the definition of parameters, which include environment variables, secrets, and certificates. Among its features are test scaling, the setting of failure criteria, and the monitoring of server metrics for application components. Additionally, you can use CSV files to define your test data and upload JMeter configurations for flexible, customizable test scripts.

You can securely store keys and credentials used during the test as Azure Key Vault secrets, and Azure Load Testing can also have its managed identity for access to Azure resources. When deployed within your virtual , it can generate load directed at your application's private endpoint. Application authentication through access tokens, user credentials, or client certificates is also supported, depending on your application's requirements.

Monitoring Application Resources

With Azure Load Testing, you can monitor your server-side performance during load tests. You can specify which Azure application components to monitor in the test configuration. You can view these server-side metrics both during the test and afterwards on the load testing dashboard. The following figure shows an example of server-side metrics obtained from an App Service after running a test. You can see the Azure services from which you can obtain server-side metrics in this link.

perftest-server-metrics.png
Azure Load Testing Server-side Performance Metrics.

Load Testing Automation

Integrating Azure Load Testing into your CI/CD pipeline is a key step in enhancing your organization's adoption of LLMOps practices. This integration enables automated load testing, ensuring consistent performance checks at crucial points in the development lifecycle. You can trigger Azure Load Testing directly from Azure DevOps Pipelines or GitHub Actions workflows, providing a simplified and efficient approach to performance testing. Below are some examples of commands to automate the creation and execution of a load test.
# Sample command to create a load test
az loadtest create 
  --name $loadTestResource 
  --resource-group $resourceGroup 
  --location $location 
  --test-file @path-to-your-jmeter-test-file.jmx 
  --configuration-file @path-to-your-load-test-config.yaml
# Sample command to run the load test
az loadtest run 
  --name $loadTestResource 
  --resource-group $resourceGroup 
  --test-id $testId

 

 
Key Metrics to Monitor During Load Tests
When conducting load tests, it's crucial to monitor certain key metrics to understand how your application performs under stress. These metrics will help you identify any potential bottlenecks or areas that need optimization. Here are some of the most important ones to keep an eye on:
  •    Request Rate: Monitor the request rate during load testing. Ensure that the LLM application can handle the expected number of requests per second.
  •    Response Time: Analyze response times under different loads. Identify bottlenecks and optimize slow components.
  •    Throughput: Measure the number of successful requests per unit of time. Optimize for higher throughput.
  •    Resource Utilization: Monitor CPU, memory, and disk usage. Ensure efficient resource utilization.
Best Practices for Executing Load Tests
To ensure your load tests are effective and yield meaningful insights, it's worthwhile to review the following recommendations. Here are some key strategies to consider:
  •    Test Scenarios: Create realistic test scenarios that mimic actual user behavior
  •    Ramp-Up Strategy: Gradually increase the load to simulate real-world traffic patterns. The warm-up period typically lasts between 20 to 60 seconds. After the warm-up, the actual load test begins
  •    Think Time: Include think time between requests to simulate user interactions.
  •    Geographical Distribution: Test from different Azure regions to assess global performance.
Performance Tuning Strategies for LLM Apps
This section discusses performance tuning for LLM Apps. Application performance is heavily influenced by design and architecture. Effective structures can manage high loads, while poor ones may struggle. We'll cover various performance tuning aspects, not all of which may be universally applicable.
Application Design
  • Optimize Application Code: Examine and refine the algorithms and backend systems of your LLM application to increase efficiency. Utilize asynchronous processing methods, such as Python's async/await, to elevate application performance. This method allows data processing without interrupting other tasks.
  • Batch Processing: Batch LLM requests whenever possible to reduce overhead. Grouping multiple requests for simultaneous processing improves throughput and efficiency by allowing the model to better leverage parallel processing capabilities, thereby optimizing overall performance.
  • Implement Caching: Use caching for repetitive queries to reduce the application's load and speed up response times. This is especially beneficial in LLM applications where similar questions are frequently asked. Caching answers to common questions minimizes the need to run the model repeatedly for the same inputs, saving both time and computational resources. Some examples of how you can implement this include using Redis as a semantic cache or Azure APIM policies.
  • Revisit your Retry Logic: LLM model deployments might start to operate at their capacity, which can lead to 429 errors. A well-designed retry mechanism can help maintain application responsiveness. With the OpenAI Python SDK, you can opt for an exponential backoff algorithm. This algorithm gradually increases the wait time between retries, helping to prevent service overload. Additionally, consider the option of falling back on another model deployment. For more information, refer to the load balance item in the Solution Architecture section.
Prompt Design
  • Generate Less Tokens: To reduce model latency, create concise prompts and limit token output. According to the OpenAI latency optimization guide, cutting 50% of your output tokens can reduce latency by approximately 50%. Utilizing the ‘max_tokens' parameter can also expedite response time.
  • Optimize Your Prompt: If dealing with large amounts of context data, consider prompt compression methods. Approaches like those offered by LLMLingua-2, fine-tuning the model to reduce lengthy prompts, eliminating superfluous RAG responses, and removing extraneous HTML can be efficient. Trimming your prompt by 50% might only yield a latency reduction of 1-5%, but these strategies can lead to more substantial improvements in performance.
  • Refine Your Prompt: Optimize the prompt text by placing dynamic elements, such as RAG results or historical data, toward the end of your prompt. This enhances compatibility with the KV cache system commonly used by most large language model providers. As a result, fewer input tokens need processing with each request, increasing efficiency.
  • Use Smaller Models: Whenever possible, pick smaller models because they are faster and more cost-effective. You can improve their responses by using detailed prompts, a few examples, or by fine-tuning.
Solution Architecture

  • Provisioned Throughput Deployments: When using Azure OpenAI use provisioned throughput in scenarios requiring stable latency and predictable performance, avoiding the ‘noisy neighbor' issue in regular standard deployments.
  • Load Balancing LLM Endpoints: Implement load balancing for LLM deployment endpoints. Distribute the workload dynamically to enhance performance based on endpoint latency. Establish suitable rate limits to prevent resource exhaustion and ensure stable latency.
  • Resource Scaling: If services show strain under increased load, consider scaling up resources. Azure allows seamless scaling of CPU, RAM, and storage to meet growing demands.
  • Network Latency: Position Azure resources, like the Azure OpenAI service, near your users geographically to minimize network latency during data transmission to and from the service.

Azure OpenAI Benchmarking

The Improved Azure OpenAI Benchmarking Tool enables you to assess the performance of Azure OpenAI deployments and choose the ideal model and deployment approach (PTU vs. pay-as-you-go) for your specific needs. It simulates various traffic patterns and provides detailed latency metrics. This tool is particularly useful during model selection and experimentation in the initial phases of a project, as it assists developers in determining whether the model deployment is appropriately sized for your project needs.

The Benchmarking tool works by creating traffic patterns that mirror the expected test load. The tool can either automatically generate test requests with random words in the prompt (simulating a workload with a certain number of context tokens), or it can be used with pre-generated messages data, such as data captured or generated from an existing production application. By default, each request is also given a random prefix to ensure that the serving engine processes each request, which is designed to avoid overly positive results that might come from server-side engine optimizations like caching. When conducting the test, it is important to make sure each test runs for long enough for the throughput to reach a stable state, especially when the utilization is close to or at 100%. 180 seconds is generally long enough for most tests.

Test Parameters

The benchmarking tool contains a number of configuration parameters to configure the test, as well as two script entry points. The benchmark.bench entry point is the basic script point, while the benchmark.contrib.batch_runner entry point can run batches of multiple workload configurations, and will automatically warm up the model endpoint prior to each test workload. It is recommended to use the batch_runner entry point to ensure accurate results and a much simpler testing process, especially when running tests for multiple workload profiles or when testing with PTU model deployments.

The README details the many different options for configuring and running benchmark tests, but some of the key parameters are as follows:

Parameter Description
rate Controls the frequency of requests in Requests Per Minute (RPM), allowing for detailed management of test intensity.
clients Enables you to specify the number of parallel clients that will send requests simultaneously, providing a way to simulate varying levels of user interaction.
context-generation-method Allows you to select whether to automatically generate the context data for the test (–context-generation-method generate), or whether to use existing messages data for the test (–context-generation-method replay)
shape-profile Adjusts the request characteristics based on the number of context and generated tokens, enabling precise testing scenarios that reflect different usage patterns. Options include “balanced”, “context”, “custom” or “generation”.
context-tokens (for custom shape-profile) When context-generation-method = generate and shape-profile = custom, this allows you to specify the number of context tokens in the request.
max-tokens (for custom shape-profile) This allows you to specify the maximum number of tokens that should be generated in the response.
aggregation-window Defines the duration, in seconds, for which the data aggregation window spans. Before the test hits the aggregation-window duration, all stats are computed over a flexible window, equivalent to the elapsed time. This ensures accurate RPM/TPM stats even if the test ends early due to hitting the request limit. A value of 60 seconds or more is recommended.
log-save-dir If provided, the test log will be automatically saved to the directory, making analysing and comparing different benchmarking runs simple.

Warming up PTU endpoints

When testing PTU deployments, it is important to warm up the endpoint prior to the benchmarking workload. This is because PTU endpoints offer a short period of burst capacity until their utilization reaches 100%, after which they will revert back to providing the same throughput as can be calculated with the PTU capacity calculator (accessible in the Quotas tab of Azure OpenAI Studio). To make this process easier, the benchmark.bench.batch_runner entry point will automatically detect and warm-up PTU endpoints, ensuring accurate and realistic results with minimal effort.

 

Retry Strategy

The retry parameter allows you to set the retry strategy for requests, offering options such as “none” or “exponential”, which can be crucial for handling API request failures effectively. When setting up a retry strategy for Azure OpenAI benchmarking, it's crucial to select an approach that carefully balances resource capacity to avoid skewing latency statistics.

When running a test with retry=none, throttled requests are immediately retried with a reset start time, and latency metrics only reflect the final successful attempt, which may not represent the end user's experience. Use this setting for workloads within resource limits without throttling or to assess how many requests need redirecting to a backup during peak loads that surpass the primary resource's capacity.

Conversely, with retry=exponential, failed or throttled requests are retried with exponential backoff, up to 60 seconds. This approach is only recommended when testing endpoints that have automatic rerouting to a backup resource, and can result in unrealistic latency metrics if used with a test configuration that would ordinarily result in throttling. In general, always use retry=none unless you have backup resources configured behind an endpoint.

Output Metrics

When you run the test, you will obtain average and 95th percentile metrics from the following measures:


measure description
ttft Time to First Token. Time in seconds from the beginning of the request until the first token was received.
tbt Time Between Tokens. Time in seconds between two consecutive generated tokens.
e2e End to end response time.
context_tpr Number of context tokens per request.
gen_tpr Number of generated tokens per request.
util Azure OpenAI deployment utilization percentage as reported by the service (only for PTU deployments).

Sample Scenarios

 

1. Using the benchmark.bench entrypoint
In the following example, taken from the tool's README, the benchmarking tool tests a traffic pattern that sends requests to the gpt-4 deployment in the ‘myaccount' Azure OpenAI resource at a rate of 60 requests per minute, with the retry set to none, and with all logs saved to the logs/ directory. The default traffic shape is used, where each request contains 1000 context tokens, and the maximum response size is limited to 500 tokens.
$ python -m benchmark.bench load 
    --deployment gpt-4 
    --rate 60 
    --retry none 
    --log-save-dir logs/ 
    https://myaccount.openai.azure.com

2023-10-19 18:21:06 INFO     using shape profile balanced: context tokens: 500, max tokens: 500
2023-10-19 18:21:06 INFO     warming up prompt cache
2023-10-19 18:21:06 INFO     starting load...
2023-10-19 18:21:06 rpm: 1.0   requests: 1     failures: 0    throttled: 0    ctx tpm: 501.0  gen tpm: 103.0  ttft avg: 0.736  ttft 95th: n/a    tbt avg: 0.088  tbt 95th: n/a    e2e avg: 1.845  e2e 95th: n/a    util avg: 0.0%   util 95th: n/a   
2023-10-19 18:21:07 rpm: 5.0   requests: 5     failures: 0    throttled: 0    ctx tpm: 2505.0 gen tpm: 515.0  ttft avg: 0.937  ttft 95th: 1.321  tbt avg: 0.042  tbt 95th: 0.043  e2e avg: 1.223 e2e 95th: 1.658 util avg: 0.8%   util 95th: 1.6%  
2023-10-19 18:21:08 rpm: 8.0   requests: 8     failures: 0    throttled: 0    ctx tpm: 4008.0 gen tpm: 824.0  ttft avg: 0.913  ttft 95th: 1.304  tbt avg: 0.042  tbt 95th: 0.043  e2e avg: 1.241 e2e 95th: 1.663 util avg: 1.3%   util 95th: 2.6% 
2. Using the benchmark.contrib.batch_runner entrypoint
In the following example, taken from the tool's README, the batch_runner executes the following two traffic patterns for 120 seconds each, making sure to automatically warm up the endpoint prior to each run, and also saving all request input and output content from each run:
  • context_tokens=500, max_tokens=100, rate=20
  • context_tokens=3500, max_tokens=300, rate=7.5

With the num-batches and batch-start-interval parameters, it will also run the same batch of tests every hour over the next 4 hours:

$ python -m benchmark.contrib.batch_runner https://myaccount.openai.azure.com/ 
    --deployment gpt-4-1106-ptu --context-generation-method generate 
    --token-rate-workload-list 500-100-20,3500-300-7.5 --duration 130 
    --aggregation-window 120 --log-save-dir logs/ 
    --start-ptum-runs-at-full-utilization true --log-request-content true 
    --num-batches 5 --batch-start-interval 3600

 

For more detailed examples, refer to the README within the repository.

Processing and Analyzing the Log Files

After running the tests, the separate logs can be automatically processed and combined into a single output CSV. This CSV will contain all configuration parameters, aggregate performance metrics, and the timestamps, call status and content of every individual request.

$ python -m benchmark.contrib.combine_logs logs/ combined_logs.csv --load-recursive

With the combined CSV file, the runs can now easily be compared to each other, and with the individual request data, more detailed graphs that plot all request activity over time can be generated.

 

Monitoring AOAI Resource

Configuring diagnostic settings for Azure OpenAI Service is a good practice for monitoring the availability, performance, and operation of your Azure resources during performance tests. These settings allow the collection and analysis of metrics and log data from your Azure OpenAI resource.

After configuring the diagnostic settings, you can start querying the generated logs. Simply access your Azure OpenAI resource in the portal and then select Logs in the Monitoring section. Next, click on the Log Analytics Workspace that you selected during the diagnostic settings configuration and select the workspace's Logs option.

Below is a query example that retrieves logs from AzureDiagnostics for “ChatCompletions_Create” operations, conducted between 3:30 PM and 4:30 PM on April 26, 2024. It selects logs with details such as timestamp, resource, operation, duration, response code, and additional properties, enabling a detailed analysis of the operation's performance and outcomes during that hour.
AzureDiagnostics
| where TimeGenerated between(datetime(2024-04-26T15:30:00) .. datetime(2024-04-26T16:30:00))
| where OperationName == "ChatCompletions_Create"
| project TimeGenerated, _ResourceId, Category, OperationName, DurationMs, ResultSignature, properties_s​
perftest-azure-diagnostics.png
Analyzing Azure OpenAI Metrics with .

How-To Guides

Now that you understand the concepts for conducting performance tests, you can refer to the following sections where we provide a detailed guide on how to use the tools mentioned in the text to test your LLM App or your Azure OpenAI model deployment.

Wrapping Up

In conclusion, performance evaluation is crucial in optimizing LLM applications. By understanding your application's specifics, creating an efficient strategy, and utilizing appropriate tools, you can tackle performance issues effectively. This boosts user experience and ensures that your application can handle real-world demands. Regular performance evaluations using methods such as load testing, benchmarking, and can lead to your LLM application's ultimate success.

 

This article was originally published by Microsoft's Azure AI Services Blog. You can find the original article here.