Smarter Azure Open AI Usage

Many organizations are building Intelligent Applications built on Azure's Open (AOAI) services. In the path to production the same set of questions are often raised

  • How many AOAI services should I have?
  • How do I monitor and log streaming quota usage?
  • How do I prioritize PTU based AOAI and fallback to PAYG?
  • How do I round-robin between multiple AOAI servers?
  • How do I handle Open rate-limiting errors?
  • How do I enforce local rate limiting to a of services?
  • How do I enforce rate limiting to a backend AI service?
  • How do I present a group of AOAI services as a single endpoint, for a seamless shift to PTU?
  • How do I reduce risk by leveraging Open AI and Azure Open AI services but present a single endpoint to consumers?
  • How do I put a circuit breaker over an AI service that I've over-used, to fallback to others?

To help with some of these issues we can turn to services like API Management, Application Gateways, and Reverse Proxies. Each can provide a solution to a subset of the problems.

graemefoster_0-1705127934935.png

However, there are complexities hidden within these boxes that become difficult to solve

  • Prioritization and failover of groups of AOIA servers relies on custom code running in a Layer 7 .
  • Layer 7 load balancers lack real-time retry functionality and instead use asynchronous downstream health monitors.
  • Server-Side Events support makes it difficult to log quota whilst maintaining a streaming endpoint.
  • Switching between Azure Open AI, Open AI or other Open Source LLMs requires manipulation of HTTP requests.

To help with these I have published a Reference Implementation of an intelligent AI Router, “AI Central”. AI Central lets you build configurable, extensible Pipelines allowing you to govern and observe access to your AI service.

AI Central is an extensible smart reverse proxy for Azure Open AI and Open AI services.

Out of the box it provides the following

  • Consumer local rate limiting
  • Endpoint local rate limiting and circuit breakers
  • Randomized endpoint selection from a of AI services
  • Prioritized endpoint selector from a priority , to a fallback cluster
  • Bulkhead to hold and throttle load to a cluster of servers
  • Consumer Entra JWT auth (using Microsoft.Identity) with Role Authorisation
  • Consumer Entra JWT pass-thru
  • Client Key auth
  • Prompt / Token usage logging to (including Streaming Endpoints)
  • Open Telemetry metrics

graemefoster_1-1705127934946.png

Sample Scenarios

Here's some scenarios where AI Central might help you:

Scenario 1: PTU failover

  • Preferred PTU AOAI service, but fallback PAYG AOAI service
  • A group of applications that need to access AOAI services
  • A requirement for Prompt logging for audit and governance
  • Streaming quota logging for chargeback

AI Central can construct a pipeline to manage this for you:

graemefoster_2-1705127934952.png

  • The pipeline listens on a host name expecting Azure Open AI like requests.
  • The AAD check confirms that the client is permitted access to the pipelines.
  • The Prioritized endpoint selector is configured to prioritize a PTU server.
    • It dispatches the request with a backoff / retry policy and circuit breaker.
    • If it fails to receive a request it falls back to the second group of PAYG servers
    • If the response from AOAI is detected to be a streaming response, it will stream the results back to the Client, using a Tokenizer to estimate quota usage
  • Finally, the Logger asynchronously sends quota usage and prompt information to .

Scenario 2: Token based rate limiting of streaming consumers, to an AOAI server

  • Single PTU service with models shared across multiple consumers
  • Streaming quota logging for chargeback purposes
  • Fair-use policy by restricting token use by consumer

graemefoster_3-1705127934957.png

The pipeline listens on a specific hostname

  • The AAD check confirms that the client is permitted access to the pipelines
  • The Token limit checks if the client (AAD identity) has reached their token limit
  • If not, the request is dispatched to a AOAI server
  • The AOAI response is re-streamed to the consumer
  • The return pathway logs the prompt, and updates the tokens consumed by the consumer

NB: Token counting does not use a distributed algorithm. It is local to an AI Central server. Consider this if running multiple AI Central Endpoints behind a load-balancer (for example in a PaaS like Azure Container Apps, Azure App Service, etc)

Try it out

The easiest way to start is to install into your own .NET API from the nuget packages.

#Create new project and bootstrap the AICentral nuget package
dotnet new web -o MyAICentral
cd MyAICentral
dotnet add package AICentral
#optional for logging: dotnet add package AICentral.Logging.AzureMonitor

#Program.cs
//Minimal API to configure AI Central
var builder = WebApplication.CreateBuilder(args);

builder.Services.AddAICentral(
    builder.Configuration,
    additionalComponentAssemblies:
    [
        typeof(AzureMonitorLoggerFactory).Assembly //for Azure Monitor logging
    ]);
);

var app = builder.Build();

app.UseAICentral();

app.Run();

You'll need to add Configuration to define your pipelines.

The Github Repository has some good examples – https://github.com/microsoft/AICentral for a Quick Start, and https://github.com/microsoft/AICentral/blob/main/docs/configuration.md for some more complex examples.

Give it a go and let us know how you find it!

 

This article was originally published by Microsoft's Azure AI Services Blog. You can find the original article here.