A Guide to Optimizing Performance and Saving Cost of your Machine Learning (ML) Service – Part 2

Return to Part 1: Introduction to ML Service

In this section, we will explore some of the options and best practices for deploying your ML model service on Azure, especially using Azure .

We will cover choose the appropriate Azure SKU for your ML service, as well as some of the settings and limits of Azure ML that you should be aware of.

Azure VM SKUs 

After optimizing the model and framework utilization, it's essential to save costs by selecting the best SKU. Selecting the correct can help enhance performance and latency. However, we must be clear that the end goal is to reduce the cost of inference, not to get the best latency. For example, if SKU A setup can run 20% faster than SKU B but is 40% more expensive, SKU A may not be the best option.

Understand VM Configurations 

First, we must understand the VM description. You must pay close attention to the following fields:

  1. CPU type; some VMs support several CPU types. They can have many instructions, some of which can aid execution. As an example: 
    1. AVX-512, which is designed to boost vector processing performance in Skylake and Cascade Lake Intel CPUs. Broadwell CPUs, Haswell processors, and AMD processors do not support AVX-512.
    2. There is no hyperthreading on the HC-series, NDv2-series, HBv3-series, and so on. This might be handy if you have a high context transition rate in the OS or a high L3-cache miss rate.
    3. High frequency chipsets on FX-series(4.0GHz), HBv3-series(3.675GHz), etc. 
  2. Memory size; if you have a large model or several models, pay close attention to this field. 
  3. Disk: 
    1. The default OS disk size is 40GB. All models and code will be stored on this drive via AzureML. 
    2. SSD is desirable if you have code and want to write files regularly (not recommended).
  4. Network bandwidth: If the model request and/or response payload is substantial, you must pay close attention to bandwidth. Different VM SKUs may have varying bandwidth constraints. It should be noted that AzureML has its own limitations for Managed Endpoints. (Manage resources and quotas – Azure Machine Learning | Microsoft Learn)
  5. GPU: The N*-series are all GPU SKUs. They are using the most recent NVIDIA driver and CUDA version. Although all GPUs can handle inference workloads, some of them may not be cost-effective. Some GPUs, such as the T4 (NCasT4_v3-series), are tuned for inference execution using a specific framework (TensorRT). 

Understanding VM Availability and Pricing

The price of different SKUs varies. Please refer to the Virtual Machine series | Microsoft Azure for the most up-to-date VM SKU list, available regions, and pricing. Some SKUs may be available in certain regions but not others.

If the client of a model service is a web service hosted in Azure, we preferably want the model service and the client service in the same region to minimize latency. If cross-region access can't be avoided, you can find more details about network latency between Azure regions here 

In order to decide which SKU is the best fit, there are various profiling tools you could leverage. More on this in the later parts of this series.

Additionally, here are the SKU supported by Managed Online Endpoints ref. Managed online endpoints VM SKU list – Azure Machine Learning | Microsoft Learn

AzureML Settings and Limits

The AzureML settings and limits related to model service throughput and latency fall into two categories: network stack settings and container settings.


Network Stack

Here is how the AzureML network stack request flow looks like: inference-server-architecture.png

Azure Machine Learning inference HTTP server – Azure Machine Learning | Microsoft Learn 

These limits are either hardcoded or related to your deployment, such as the number of cores. Refer here for the resource limits Manage resources and quotas – Azure Machine Learning | Microsoft Learn

There is one deployment configuration you need to pay attention to:


If you are using AzureML container image or AzureML pre-built inference image, this number needs to be set the same as WORKER_COUNT (discussed below). If you are using an image built by yourself, then you need to set it to an appropriate number.

This setting defines the concurrent level at load balance time. Usually, the higher this number, the higher the throughput. However, if this number is set higher than what the model and machine learning framework can handle, it will cause requests to wait in the queue, eventually leading to longer end-to-end latency.

If the request per second is greater than (max_concurrent_requests_per_instance * number_of_instance), the client side will receive an HTTP status code 429. 

For default value for request settings, refer here CLI (v2) managed online deployment YAML schema – Azure Machine Learning | Microsoft Learn


Bring your own container 

If you are using a Docker image built by yourself, please make sure it can accept environment variables to tune the setup. Then during deployment, make sure the environment variables are set properly. 

Here is an example: 

Assume “mymodelserver” can read an environment variable “MY_THREAD_COUNT” at runtime. Here is an example of your Dockerfile: 

ENTRYPOINT [“mymodelserver”, “param1”, “param2”]

At deployment time, you can set the “MY_THREAD_COUNT” to a proper number to decide different parallelism level on different SKU. 

AzureML container 

If you are using the AzureML container image or AzureML prebuilt inference image, then WORKER_COUNT is one of the most important environment variables you need to set properly.  

In AzureML provided images, the Python HTTP server can have multiple worker processes to serve concurrent HTTP requests. Each of the worker processes will load a model instance and process requests separately. WORKER_COUNT is an integer to define how many worker processes and default value is one (1). This means if you do not set this environment variable to a proper number, even if you choose a SKU that has multiple CPU cores, the container will still only process one request at a time! 

How to determine WORKER_COUNT?

This value is determined using an iterative process. You can use the following process to determine the value of WORKER_COUNT

  1. Determine number of cores that can be used in the selected SKU. 
  • For example: F32s v2 SKU, there are 32 cores that can be utilized by the model.  
  • For each worker, determine the number of CPU cores the model execution actually needs.  
    • You can get the number by running profiling against single worker with different machine learning framework and library setup. 
    • Note: Do not over optimize for latency. As long as it can meet the requirement, you can start with the result.  

    Then, WORKER_COUNT = floor(result_1 / result_2). Make sure you slightly reserve some cores and memory for system components on the same VM. 

    Return to Part 1: Introduction to ML Service


    This article was originally published by Microsoft's AI - Machine Learning Blog. You can find the original article here.