Azure OpenAI Architecture Patterns and implementation steps


A comprehensive overview of the most frequently used and discussed architecture patterns among our customers in various domains.

1) AOAI with Azure Frontdoor for loadbalancing

  • Use Azure Front Door for cross region global load balancing of requests across multiple Azure OpenAI endpoints.
  • In this architecture below Azure Front Door routes requests to multiple instances of Azure OpenAI hosted on multiple regions.
    AFD uses health check on the path /status-0123456789abcdef to determine the health and proximity of each Azure OpenAI endpoints.
  • The deployment name should be the same if you are load balancing, since that would be in the URL path.

Architecture diagram:


Key Highlights:

  • Global load balancing across multiple Azure OpenAI endpoints in multiple regions with intelligent health probe monitoring.
  • AFD provides scale out and improved performance to your AOAI endpoints using Microsoft's global cloud CDN and WAN.
  • Unified static and dynamic delivery offered in a single tier of AFD to accelerate and scale through caching, offload, and layer 3-4 DDoS protection.
  • Protection against OWASP top 10 attacks, Common Vulnerabilities and Exposures (CVEs) and malicious bot attack through AFD WAF. Refer more here:
  • Define your own custom domain with AFD and AFD provides autorotation of managed certificates.
  • As on today AFD cannot connect to your AOAI origin using Private Link.

If you set equal weights for all origins and a high latency sensitivity in Azure Front Door, it will consider all origins that have a latency within the specified range of the fastest origin as eligible for routing traffic. So, all the origins should receive approximately equal amounts of traffic, provided their latencies are within the specified range.

However, it's important to note that this doesn't guarantee a perfect round-robin distribution. The actual distribution can vary based on factors like conditions and changes in latency. If you need strict round-robin load balancing, you might need to consider other services or features that specifically support this method.


Use Postman for testing:

Request 1:


Request 2:


For perfect round robin distribution, you can use Azure Application Gateway with the same health check endpoints.

2) AOAI with APIM

Architecture diagram:


Key highlights:

  • You can use APIM to manage the access, usage, and billing of your Azure OpenAI APIs, and apply policies such as , caching, rate limiting, and transformation.
  • You can monitor and analyze the performance and health of your Azure OpenAI APIs, and any issues using APIM's built-in tools and integrations with Azure Monitor and Application Insights.
  • You can publish your Azure OpenAI APIs to a developer portal, where you can provide documentation, samples, and interactive testing for your consumers.
  • You can use APIM to create composite APIs that can orchestrate multiple Azure OpenAI models or integrate with other Azure services and external APIs.

a) Round Robin load balancing with Retry logic

        = 500 || context.Response.StatusCode >= 400)" count="6" interval="10" first-fast-retry="true">
                = 500 || context.Response.StatusCode >= 400) && (int.Parse((string)context.Variables["backend-counter"])) == 0)">
                = 500 || context.Response.StatusCode >= 400) && (int.Parse((string)context.Variables["backend-counter"])) == 1)">

Testing on round robin load balancing using APIM :


b) from APIM to Azure OpenAI

Step 1 – Enable Managed Identity in APIM


Step 2 – Provide necessary RBAC:

In the IAM of Azure OpenAI service add the OpenAI user role for the APIM Managed Identity (Managed Identity will have the same name of APIM).


Step 3 – Add the Managed Identity policy in APIM:

Testing for Managed Identity Policy:


c) Policy to extract callerID (Subject from APIM)

For extracting other details from JWT, refer –

Azure API Management policy expressions | Microsoft Learn



d) Logging and Monitoring using APIM:

Use Azure monitor and APIM to enable enhanced logging and monitoring of the published AOAI APIs. Learn more – Tutorial – Monitor published APIs in Azure API Management | Microsoft Learn


Sample log queries for prompt completion:

 | extend model = tostring(parse_json(BackendResponseBody)['model'])
 | extend prompttokens = parse_json(parse_json(BackendResponseBody)['usage'])['prompt_tokens']
 | extend completiontokens = parse_json(parse_json(BackendResponseBody)['usage'])['completion_tokens']
 | extend responsetext = (parse_json(parse_json(BackendResponseBody)['choices'])[0]['message'])
 | extend prompttext = (parse_json(RequestBody)['messages'])

For more queries refer to documentation here: Implement logging and monitoring for Azure OpenAI large language models – Azure Architecture Center …

e) For advanced logging, more than 8192 bytes refer to the documentation here: openai-python-enterprise-logging/advanced-logging at main · Azure-Samples/openai-python-enterprise-l…

f) For Budgets and cost management using APIM refer this blogAzure Budgets and Azure OpenAI Cost Management – Microsoft Community Hub

3) AOAI with Frontdoor and APIM multi-region deployment for a full-fledged multi-region availability

Refer to the DR documentation – Deploy Azure API Management instance to multiple Azure regions – Azure API Management | Microsoft Le…


a. In Frontdoor give both APIM regional URLs as backend Origins, example &

b. Configure the API Management regional status endpoints – e.g.

c. Sample policy to be used to make the regional gateways route to respective backends.


In conclusion, this article will be a starting point to implement scalable architecture patterns using Azure OpenAI models with other Azure services. As we continue to explore the potential of , we'll continue to update our patterns and documents, guiding us towards smarter and more efficient systems.


This article was originally published by Microsoft's Azure AI Services Blog. You can find the original article here.