Best Practice Guidance for PTU

Co-authors: Luca Stamatescu, Alex Shen, Ming Gu, Michael Tremeer

Contributors: Ranjani Mani

In today's rapidly evolving tech landscape, proficiency with Azure OpenAI is crucial for maintaining a competitive edge. Enhanced by the latest announcements at Build 2024, this guide from our Global Black Belt team focuses on the effective management of Provisioned Throughput Units (PTUs) to optimize both performance and cost-efficiency in deployments. This document provides essential strategies for strategic PTU management, alignment with the Azure Well-Architected Framework, and the latest Azure API management tools, empowering you to fully utilize the capabilities of Azure OpenAI.

Understanding and Leveraging PTUs for Azure Open AI

Understanding the Value of PTUs

Provisioned Throughput Units (PTUs) are designed to deliver consistent and predictable performance for AI models on Azure OpenAI, making them an essential consideration for efficient deployments. By understanding the advantages of PTUs, such as cost-effectiveness through appropriate sizing and handling peak loads with overflow patterns, one can appreciate how PTUs strike a balance between performance and cost-efficiency. These units maintain minimal latency and provide a high-quality user experience, even during peak usage times. By integrating PTUs with Pay-As-You-Go options for additional traffic, optimal cost management is achievable while maintaining high service levels.

shikhaagrawal_4-1716683887799.png

Call To ActionFor a comprehensive understanding of how PTUs can improve Azure OpenAI deployment's performance and cost-effectiveness, please refer to the following resources.

Primary ResourceAzure OpenAI Service provisioned throughput – Azure AI services | Microsoft Learn

Additional Resources:

Right-size your PTU deployment and save big

 

Estimating PTU Requirements

Accurately estimating PTUs for a deployment requires analysis of specific scenarios, whether the traffic is consistent or subject to peaks. Implementing spillover or bursting strategies can efficiently manage unexpected traffic surges without the need for excessive provisioning. Using historical data alongside the Azure OpenAI Capacity Calculator allows for the simulation of various traffic patterns and precise PTU estimations. It's important to align with the customer's performance expectations and their tolerance for potential performance dips during peak periods.

Call To Action: Utilize the Azure OpenAI Capacity Calculator to prepare accurate PTU estimates. Engage with the GBB team for benchmarking support to ensure optimal solution sizing.

Primary Resource: Azure Open AI Solution Sizing Tool

Additional Resources:

Azure/azure-openai-benchmark: Azure OpenAI benchmarking tool (github.com)

Securing and Optimizing Azure OpenAI Deployments

Security, Compliance, and PII Management Guidance

It is crucial to guide new customers in securing their Azure OpenAI deployments. This involves using Microsoft Entra ID for and implementing role-based access control (RBAC). Educating about security strategies and potential vulnerabilities as outlined in the Azure Well-Architected Framework is essential for enhancing security postures. Discuss the importance of a strategy that includes plans for fine-tuned models and training data and implementing Private Endpoints to restrict network access to trusted sources only.

Handling PII through GenAI GatewayThe GenAI offers options for managing Personally Identifiable Information (PII), such as centralized detection and masking, and using Azure Text Analytics API for automated PII detection. Employing workflow automation tools like Azure Functions or Logic Apps can automate the detection and masking process, effectively balancing latency with privacy needs.

Call To Action: Ensure your Azure Open AI deployment aligns with for security and compliance as per the Azure Well-Architected Framework.

Primary Resource: Azure Well-Architected Framework perspective on Azure OpenAI Service

Additional Resources:

Security and Data Integrity | Microsoft Learn

 

Handling Peak Loads with Advanced Load Distributions Patterns

In managing peak loads with Azure OpenAI resources, it is critical to implement efficient patterns like load balancer, circuit breakers and Token Limit policy, which align with Azure's architectural . These strategies help manage traffic efficiently without over-provisioning PTUs, ensuring that service quality is maintained even during high traffic periods. The recent announcements on GenAI have further simplified the process of implementing APIM policies while enriching the functionalities further.

shikhaagrawal_5-1716683887805.png

Azure OpenAI Token Limit policy allows you to manage and enforce limits per API consumer based on the usage of Azure OpenAI tokens. With this policy you can set limits, expressed in tokens-per-minute (TPM). 

shagrawal_0-1716886816587.png

This policy provides flexibility to assign token-based limits on any counter key, such as Subscription Key, IP Address or any other arbitrary key defined through policy expression. Azure OpenAI Token Limit policy also enables pre-calculation of prompt tokens on the Azure API Management side, minimizing unnecessary request to the Azure OpenAI backend if the prompt already exceeds the limit. 

Load Balancer and Circuit Breaker features allow you to spread the load across multiple Azure OpenAI endpoints. With support for round-robin, weighted (new), and priority-based (new) load balancing, you can now define your own load distribution strategy according to your specific requirements.  

shagrawal_1-1716886943793.png

Implementation Considerations: Define priorities within the load balancer configuration to ensure optimal utilization of specific Azure OpenAI endpoints, particularly those purchased as PTUs. In the event of any disruption, a circuit breaker mechanism kicks in, seamlessly transitioning to lower-priority instances based on predefined rules. Our updated circuit breaker now features dynamic trip duration, leveraging values from the retry-after header provided by the backend. This ensures precise and timely of the backends, maximizing the utilization of your priority backends to their fullest. 

Call to Action: Evaluate your application's peak load behaviors to determine the most suitable strategy for ensuring continuous critical operation with consistent performance.

Primary Resources: Implementing a Gen AI Gateway

Additional Resources:

GenAI Gateway Accelerator

AI-hub solultion accelerator

GenAI Gateway Capabilities in APIM

OpenAI PTU – Handling High Utilization

 

Monitoring and Optimization

 

Setting Up Azure Monitoring Workbooks

Effective monitoring of your Azure OpenAI PTU deployments can be significantly enhanced by utilizing Azure's comprehensive suite of monitoring tools. Here's a streamlined approach to ensure you have complete visibility and control over your deployment's performance:

Azure OpenAI Metrics Dashboards: Start with the out-of-box dashboards provided by Azure OpenAI in the Azure portal. These dashboards display key metrics such as HTTP requests, tokens-based usage, PTU utilization, and fine-tuning activities, offering a quick snapshot of your deployment's health and performance.

Analyze Metrics: Utilize Azure Monitor metrics explorer to delve into essential metrics captured by default:

  • Azure OpenAI Requests: Tracks the total number of API calls split by Status Code.
  • Generated Completion Tokens and Processed Inference Tokens: Monitors token usage, which is crucial for managing capacity and operational costs.
  • Provision-managed Utilization V2: Provides insights into utilization percentages, helping prevent overuse and ensuring efficient resource allocation.
  • Time to Response: Time taken for the first response to appear after a user send a prompt.

It is good to monitor max/avg values at different granularities ranging from minute to day level.

Diagnostic Settings and Log Analytics: Configure diagnostic settings to collect detailed logs and metrics, routing them to Azure Monitor Logs. Use Azure Log Analytics for an in-depth examination of these logs, allowing for complex analyses and the creation of customized visualizations. This integration is key for diagnosing specific issues and gaining deeper operational insights.

Alerts Setup: Set up alerts in Azure Monitor to receive proactive notifications about conditions that could impact your deployment. These alerts can be instrumental in addressing potential issues swiftly, maintaining system performance and availability.

Azure Monitor Workbooks: Employ Azure Monitor Workbooks to create custom interactive reports and visualizations. Tailor these reports to meet your specific monitoring needs, and share them with your team for collaborative analysis and decision-making.

Call To Action: Enhance your monitoring capabilities for Azure OpenAI PTU deployments by implementing these integrated Azure monitoring tools. Start with viewing metrics on the dashboards, set up diagnostic settings to collect detailed data, and use Azure Monitor Workbooks and Log Analytics for comprehensive analysis and proactive management. This approach ensures that your deployments are monitored effectively, supporting optimal performance and reliability.

Primary Resources: Monitoring Azure OpenAI Service – Azure AI services

Additional Resources:

Monitoring your Azure OpenAI usage (part 2) – Stefano Demiliani 

 

Calculating Usage-Based Chargebacks

To calculate usage-based chargebacks for Provisioned Throughput Units (PTUs) when sharing an Azure OpenAI instance across multiple business units, it is essential to monitor and log token consumption accurately. Incorporate the “azure-openai-emit-token-metric” policy in Azure API Management to emit token consumption metrics directly into Application Insights. This policy facilitates tracking various token metrics such as Total Tokens, Prompt Tokens, and Completion Tokens, allowing for a thorough analysis of service utilization. Configure the policy with specific dimensions such as User ID, Client IP, and API ID to enhance granularity in reporting and insights. By implementing these strategies, organizations can ensure transparent and fair chargebacks based on actual usage, fostering accountability and optimized resource allocation across different business units.

Call To Action: Implement robust chargeback calculation mechanisms by leveraging Azure API Management Policies and leverage out of the box/custom Application Insights Dashboards for clear visibility into PTU usage across your organization.

Primary Resource: Azure API Management policy reference – azure-openai-emit-token-metric

Additional Resources:

Azure OpenAI Emit Token Metric policy in Azure API Management

Implementing Cost-Efficient Solutions

To optimize cost efficiency in managing Provisioned Throughput Units (PTUs), organizations should consider a multi-faceted approach:

Semantic Caching: Utilize the Azure OpenAI Semantic Caching policy to reduce token usage by storing and reusing responses for semantically similar prompts through Azure Redis Enterprise or compatible caches. This method enhances response times and reduces costs.

Batch Workloads in Lean Times: Schedule batch processing tasks during periods of low PTU utilization to maximize the use of pre-purchased PTUs, ensuring efficient resource usage throughout the day.

Spillover Pattern: Adopt a spillover strategy to manage excess traffic by routing it to Pay-As-You-Go (PAYG) endpoints during peak times. This allows for maintaining service quality with minimal PTU over-provisioning and significant cost savings.

Call to Action: Assess your current PTU management strategies and consider integrating these approaches to enhance efficiency and reduce costs. Utilize tools like the Azure OpenAI Sizing Tool to evaluate and optimize your PTU allocations based on real usage patterns, ensuring a balanced and cost-effective deployment.

Primary Resources:

Support and Escalation

Managing Service Parameters with AOAI Deployments                                                                        

Effective management of Service Parameters with Azure OpenAI PTU deployments hinges on meticulous capacity monitoring and handling overcapacity intelligently. Azure uses a variation of the leaky bucket algorithm to manage bursts and prevent utilization from exceeding 100%. This strategy helps balance high utilization against the risk of over-utilization. After processing a request, Azure adjusts the utilization measure based on the actual compute cost, ensuring accurate capacity monitoring.

shikhaagrawal_6-1716683887807.jpeg

Here's a concise guide to maintaining your service parameters efficiently:

Key Metrics and Monitoring

  • Provisioned-Managed Utilization V2: This Azure Monitor metric is crucial for real-time monitoring, measuring your deployment's utilization in 1-minute increments. It ensures that your deployment processes calls with consistent timing.

Handling Capacity Limits

  • 429 Responses: Not an error but an indicator of full capacity usage at a given time. Managing these involves two strategic approaches:
  • Traffic Redirection: Redirect excess traffic to alternative models or deployments to maintain service availability.
  • Retry Logic: Implement client-side retry mechanisms where longer latencies are acceptable, thereby maximizing throughput.

Managing Concurrent Calls

  • Understand the impact of prompt size and max_token settings on the number of manageable concurrent calls. The system continues to accept requests until reaching 100% utilization, guided by the settings you choose.

Call To Action: Regularly monitor the Provisioned-Managed Utilization V2 metric to optimize service parameters. Respond promptly to 429 responses by redirecting traffic or employing retry strategies. Adjust max_token values and other settings to optimize the balance between concurrency and utilization, ensuring your deployments maintain optimal value of service parameters.

Primary Resource: Understanding Rate Limits-limits

 

Handling Throttling Errors          

When encountering throttling errors with Provisioned Throughput Units (PTUs), it is important to understand and follow to avoid rate limiting issues. Throttling typically occurs when the service's request rate exceeds the quota or allocated PTUs. To handle throttling effectively:

  1. Gradually increase workloads to avoid sudden spikes in demand.
  2. Batch requests where appropriate, combining multiple tasks into a single call, especially if there's headroom on tokens per minute but limits are being hit on requests per minute.
  3. Implement retry logic with exponential backoff strategies for handling 429 (Too Many Requests) errors, allowing subsequent attempts at spaced-out intervals.
  4. Monitor your application's usage patterns and adjust PTU allocation accordingly using Azure AI Studio's quota management features.
  5. Utilize Azure API Management (APIM) to implement policies for queuing, rate-throttling, and error handling.

By adhering to these guidelines and proactively managing PTU allocations, you can minimize the impact of throttling on your applications' performance.

Call To Action: Review and apply the outlined strategies to mitigate throttling errors in your deployment, ensuring optimal performance of your Azure OpenAI services. If persistent issues occur, consider consulting with Azure technical support for further assistance.

Primary Resource: Optimizing Azure Open AI – A guide to limits & quotas

Additional Links for PTU Deep Dive:

  1. Azure OpenAI Landing Zone reference architecture
  2. Baseline architecture for Enterprise Chat Applications
  3. Azure Workbooks – Monitoring OpenAI Workloads with Confidence
  4. Azure OpenAI Using PTUs/PayGos with APIM – Using the Scaling Special Sauce
  5. Smart load balancing for OpenAI endpoints and Azure API Management
  6. Use APIM to configure retry and fall back to another instance based on HTTP status

 

This article was originally published by Microsoft's Azure AI Services Blog. You can find the original article here.