Leveraging NVIDIA Triton Inference Server and Azure AI for Enhanced Inference Efficiency

This blog is co-authored by NVIDIA. 

Inference is where training goes to work for enterprises, delivering the most visible returns on investments. Inference is also a recurring cost for many enterprises, making it a top CIO priority. IDC predicts that by 2027, the amount spent on accelerated servers in the cloud for inferencing will be more than three times that of on-premises spending.  

NVIDIA shares a long history of collaboration with Microsoft. Azure was the first cloud to host the NVIDIA AI Enterprise software platform, and today offers one of the widest and broadest selections of NVIDIA AI-powered and -optimized . These encompass over five generations of NVIDIA GPUs, including the latest additions, the NVIDIA H100 and H200 Tensor Core GPUs.  

The collaboration between the two companies goes beyond and extends into cloud MLOps. Specifically, NVIDIA Triton Inference Server is seamlessly integrated in Azure Machine Learning to help enterprises reduce the complexity of model-serving infrastructure, shorten the time needed to deploy new AI models in production, and increase AI inferencing and prediction capacity. Today, the Azure Machine Learning platform offers NVIDIA Triton as a fully managed inference endpoint, eliminating the need for enterprise customers to download the container or write any manual code to deploy the server into production.  

NVIDIA Triton Inference Server Meets Copilot for Microsoft 365 

Building on this rich history of engineering collaboration, we're thrilled to share that NVIDIA GPUs and NVIDIA Triton Inference Server now help serve AI Inference in Copilot for Microsoft 365. Soon available as a dedicated physical keyboard key on Windows PCs, Copilot for Microsoft 365 combines the power of large language models with proprietary enterprise data to deliver real-time contextualized intelligence, enabling users to enhance their creativity, productivity, and skills.  


Caption: Benchmark showing performance of serving Copilot semantic index models on NVIDIA Triton with ONNX runtime 1.14.1 vs. custom HTTP server using ONNX runtime 1.12.1. Source: Microsoft 

Get started 

NVIDIA Triton Inference Server is seamlessly integrated in Azure Machine Learning managed endpoints as a production release branch that provides monthly patches and bug fixes with a nine-month lifespan and can be deployed without manual code. It is also available as a free, open-source solution on GitHub and can be downloaded directly from NVIDIA NGC, a portal of enterprise services, software, management tools, and support for end-to-end AI and digital twin workflows. NVIDIA Triton Inference Server is also included with NVIDIA AI Enterprise software, a platform for security, API stability, and enterprise support—available on the Azure Marketplace.  


This article was originally published by Microsoft's AI - Machine Learning Blog. You can find the original article here.