The LLM Latency Guidebook: Optimizing Response Times for GenAI Applications

Co-authors: Priya Kedia, Julian Lee, Manoranjan Rajguru, Shikha Agrawal, Michael Tremeer

Contributors: Ranjani Mani, Sumit Pokhariyal, Sydnee Mayers

Generative applications are transforming how we do business today, creating new, engaging ways for customers to engage with applications. However, these new LLM models require massive amounts of compute to run, and unoptimized applications can run quite slowly, leading users to become frustrated. Creating a positive user experience is critical to the adoption of these tools, so minimising the response time of your LLM API calls is a must. The techniques shared in this article demonstrate how applications can be sped up by up to 100x their original speed* through clever prompt engineering and a small amount of code!

Previous work has identified the core principles for reducing LLM response times. This article expands upon these, by providing practical examples coupled with working code, to help you accelerate your own applications and delight customers. This article is primarily intended for software developers, data scientists and application developers, though any business stakeholder managing GenAI applications should read on to learn new ideas for improving their customer experience.

Understanding the drivers of long response times

The response time of an LLM can vary based on four primary factors:

  • the model used.
  • the number of tokens in the prompt.
  • the number of tokens generated.
  • the overall load on the deployment & system.

You can imagine the model as a person typing on a keyboard, where each token is generated one after another. The speed of the person (the model used) and the amount they need to type (the number of generation tokens) tend to be the largest contributor to long response times.

LucaStamatescu_0-1714978609743.png

Figure 1 – The response generation step typically dominates the overall response time. Not to scale.

Techniques for improving LLM response times

The below table contains a range of recommendations that can be implemented to improve the response times of your Generative application. Where applicable, sample code is included, to allow you to see these benefits for yourself, and copy the relevant code or prompts into your application.

Intuition GitHub Potential Speed up of application
1. Generation Token Compression Prompt the LLM to return the shortest response possible. A few simple phrases in your prompt can speed up your application. Few-shot prompting can also be used to ensure the response includes all the key information. Link Up to 2-3x or more

20s -> 8s

2. Avoid using LLMs to output large amounts of predetermined text Rather than rewriting documents, use the LLM to identify which parts of the text need to be edited, and use code to make the edits. For RAG, use code to simply append documents to the LLM response. Link Up to 16x or more

310s-> 20s

3. Implement semantic caching By caching responses, LLM responses can be reused, rather than calling Azure OpenAI, saving cost and time. The input does not need an exact match- for example “How can I sign up for Azure” and “I want to sign up for Azure” will return the same cached result. Link Up to 14x or more

19s -> 1.3s

4. Parallelize requests Many use cases (such as document processing, classification etc.) can be parallelized. Link Up to 72x or more

180s -> 2.5s

5. Use GPT-3.5 over GPT-4 where possible GPT-3.5 has a much faster token generation speed. Certain use cases require the more advanced reasoning capabilities of GPT-4, however sometimes few-shot prompting or finetuning may enable GPT-3.5 to perform the same tasks. Generally only recommended for advanced users, after attempting other optimizations first. Link Up to 4x

17s -> 5s

6. Leverage translation services for certain languages Certain languages have not been optimised, leading to long response times. Generate the output in English and leverage another model or API for the translation step. Link Up to 3x

53s -> 16s

7. Co-locate cloud resources Ensure model is deployed close your users. Ensure Azure Search and Azure OpenAI are as closely located as possible (in the same region, , vNet etc.). NA 1-2x
8. Having an additional endpoint for handling overflow capacity (for example, a PTU overflowing to a Pay-as-you-Go endpoint) can save latency by avoiding queuing when retrying requests. Link Up to 2x

58s -> 31s

9. Enable streaming Streaming improves the perceived latency of the application, by returning the response in chunks as soon as they are available. Coming soon Coming soon
10. Separation of workloads Mixing different workloads on the same endpoint can negatively impact latency. 1) This is because short completions batched with longer ones will have to wait before being sent back. 2) Mixing the calls can reduce your cache hit rate as they are both competing for the same space. Coming soon Coming soon

Putting it into practice through case studies

This section includes an overview of two case studies, which represent typical GenAI applications- perhaps one is similar to yours! The linked code repositories show the original speed of the application, and then walk you through the process of implementing different combinations of the techniques in this document. Implementing these recommendations achieved an improvement in the response time ranging from 6.8-102x!

Case Study Techniques applied Cumulative speed improvement GitHub
Document processing

Rewrite a document to correct spelling errors and grammar. This example can be extended with custom logic to adapt to more specific document processing use cases.

1. Base case 1x (315s) Link
2. Avoid rewriting documents 8.3x (38s)
3. Generation token compression 15.8x (20s)
4. Parallelization 105x (3s)
Retrieval Augmented Generation (RAG)

Help a user a product which is not working.

1. Base case 1x (23s) Link
2. Generation token compression 2.3x (9.8s)
3. Avoid rewriting documents 6.8x (3.4s)
Retrieval Augmented Generation (RAG)

Provide general product information

1. Base case 1x (17s) Link
2. Semantic caching 17x (1s)

Conclusion

With Generative AI transforming how people interact with applications, minimising response times is essential. If you're interested in improving your GenAI application's performance, select a few of these recommendations, clone the repository, and implement them in your application's next release!

*Disclaimer: The results depicted are merely illustrative, emphasizing the potential benefits of these techniques. They are not all-encompassing and are based on a single test. Response times may differ with each run, thus the main goal is to demonstrate relative improvement. The tests are performed using the powerful, but slower, GPT-4 32k model, with a focus on improving response times. The effectiveness of techniques like error correction through document rewriting varies depending on the input; a document with many errors might take longer to correct than to rewrite entirely. Therefore, these techniques should be tailored to your application.

 

This article was originally published by Microsoft's Azure AI Services Blog. You can find the original article here.