Cost Optimized hosting of Fine-tuned LLMs in Production


As organizations strive to leverage the power of LLMs on their own data, two prominent strategies have emerged: Retrieval Augmented Generation (RAG), Fine-tuning and combination of the two (Hybrid). Although both the approaches hold the potential to tailor responses based on their own data, these approaches present distinct advantages and challenges.  

This blog outlines the advantages and disadvantages of RAG and fine-tuning methodologies while solving business use cases and focuses on implementing fine-tuned model in a cost optimized way in few use case scenarios.   


Retrieval-Augmented Generation (RAG)  

Retrieval-Augmented Generation (RAG) is a methodology that combines the power of retrieval-based and generative systems to enhance the performance of Large Language Models (LLMs). This approach retrieves information from a large corpus of data that can be used to augment the knowledge and responses of an LLM.  



  1. Enhance LLM Knowledge by providing dynamic content beyond what it was trained on without changing its weights and biases, leading to more informed and accurate responses. 
  2. Handle large volumes of data and addresses constraints posed by the models on the input data context through mechanisms like chunking by retrieving the most relevant information for each query. 
  3. By leveraging existing pre-trained models and external data sources like Blob , Databases, RAG can reduce the time required to prepare dataset and train a model from scratch.  
  4. RAG may be more cost-effective than fine-tuning an LLM, as powerful computes are required for fine-tuning and hosting. The charges are applied based on the size of data being trained, training hours and /or charges for the hosting hours. 


  1. Implementing RAG system can be complex as it involves: 
    1. Orchestration of the interaction between the data source(s) from where data is retrieved and the LLMs.  
    2. Ensuring the relevant chunk of data to be retrieved through proper chunking mechanism and requires thorough testing. 
  2. Choice of Search mechanism or combination of search mechanisms for accurate retrieval of data. 
  3. The failure points could be retrieval of wrong context or ranking the relevant content  
  4. The retrieval process can add latency to the response time, as the system needs to search through the data source(s) for bringing the relevant context data to the LLM.  
  5. The accuracy of the RAG approach depends on the quality and relevance of the data sources. Poorly curated data can lead to inaccurate or irrelevant responses. 
  6. Frequent indexing of data is required when frequent updates occur in the data. This must be automated which will require triggering pipelines to be integrated. 
  7. Additionally, the context length of the LLMs also pose a challenge to the limit of context length. While models like GPT 4 Tubo have large context length, experience has shown that large context lengths lead to poor performance in many cases. 
  8. Since the pricing of these models is based on the token length, large context lengths can be costly as well, while a smaller context leads to a risk of missing relevant content. 

Fine-Tuning of LLMs 


Majority of LLMs like OpenAI, Llama, Falcon, Mistral etc. offer the capability to fine tune them with specific datasets for tasks like text generation, classification etc. Fine-tuning is particularly beneficial when the objective is to do tasks that require a level of specificity and customization that general models may not readily provide, where the guiding information (context data) is too voluminous or intricate to be encapsulated within a single prompt. There are different ways in which finetuning can be achieved: 

  1.  Full fine-tuning: During the full fine-tuning process, the LLM is initialized with pretrained weights and then further trained on task-specific data. This is achieved through methods like backpropagation and gradient descent. All model parameters, inclusive of pretrained weights, are updated to minimize a task-specific loss. This loss measures the discrepancy between the model's predicted outputs and the actual values. Full fine-tuning empowers the model to learn task-specific patterns from the labeled data. This process enhances the model's ability to generate predictions or outputs that are specifically tailored to the target tasks based on the labelled dataset. 
  2. Parameter Efficient fine-tuning (PEFT): The process of full fine-tuning requires considerable computational resources and labeled data. This is because the model is trained a new for a specific target task. Furthermore, as Large Language Models (LLMs) have billions of parameters, the demand for computational resources intensifies. The full fine-tuning process, therefore, places an even greater strain on these resources. Parameter Efficient Fine-Tuning (PEFT) methods strive to mitigate these hefty requirements. They selectively update or modify specific set of parameters of the LLMs while still delivering a performance on par with full fine-tuning. It is also possible that full fine-tuning may lead to overfitting since most times the task-specific dataset might be limited. Different PEFT techniques include Additive fine-tuning, Partial fine-tuning, Reparametrized fine-tuning, Hybrid fine-tuning etc. Through the fine-tuning process, an existing LLM is retrained with sample data.  


  1. Fine-tuning can significantly improve the performance of pre-trained LLMs by adapting the pre-existing knowledge to the new task, resulting in a more accurate and efficient model.  This would result in higher quality results than what you can get just from prompt engineering combined with RAG.     
  1. Fine-tuned model becomes context aware, and hence no additional systems are required to bring in the context data to the LLMs (in most scenarios, although there are certain cases where a hybrid approach of RAG and fine-tuning is used). There would be better control on the input data context length.     
  1. Helps in Lowering the latency of responses, particularly when using smaller models. 


  1. Cost implications: Requires expensive computational resources such as GPUs or TPUs to train the model at a definite interval. The cost can increase if the model requires a longer training time due to the large amount of data.  If the frequency of training is low, the cost from this can be optimized. However, hosting the models at scale also requires expensive GPUs, and hence continuous hosting would be extremely expensive. (In this blog, we are outlining certain solutions which can help optimize this cost). 
  2. Dataset must be prepared in specific formats, and it can be time-consuming and expensive based on the size and nature of data. 
  3. Re-training pipelines should be in place to ensure that the model continues to perform well on new and unseen data.  

Most customers choose the RAG approach over the fine-tuning approach although the latter provides accurate and better-quality responses due to some disadvantages mentioned above. 


In this blog, we are exploring scenarios where finetuned models can be hosted for inferencing through on–demand deployment which will avoid continuous hosting of the deployed models thereby reducing the hosting charges drastically. We will be using Azure OpenAI base models to describe the solutioning approaches while fine-tuning and hosting. 

We will be using the term “fine-tuning” irrespective of the type of finetuning used since the focus of the blog is to optimize the hosting charges but still use the fine-tuned model. 


Let us recap the steps for finetuning Azure OpenAI model and its deployment. In the case of Azure OpenAI models, the finetuning operations can be performed using REST API or SDK.  


  1. Prepare training and validation data:

The dataset must be annotated based on the specific task and the format must be JSONL encoded in UTF-8. The file size must be less than 100 MB in size.  Depending on the type of Azure OpenAI base models (Completion or Chat models), the data must be prepared with “prompt” and “completion” in the former and “messages” with the corresponding “roles” in the latter case. The minimum number of training samples could be as low as 10 but might not be sufficient for good quality output and hence as to have at least a min of 50 good quality training examples. While increasing the number of samples, ensure that those contain highest quality examples and are representative of the data. Hence data pruning would be a key step before training the models which would otherwise result in responses which are worse. This would also help in optimizing the size of the training data, which is related to the training time and hence the cost. The training cost varies with the base models and also the training time on Azure. 

For chat models, the training and validation data must be prepared as messages with System role, user role and assistant role and the corresponding content. The content corresponding to the “system” role will remain the same while the content related to “user” and “assistant” will be used to capture the information with which the Azure OpenAI chat model has to be fine-tuned with. In the case of Completion models, the data must be prepared as “prompt” and “completion”. For automating the process of finetuning, there must be some preliminary checks on the training and validation files for the number of training samples, the format of the data, whether the total token count for each individual sample is within the corresponding models token limit etc. This can be included as a part of the CI/CD pipeline so that appropriate corrections can be done before starting the finetuning. 

2. Fine-tune the base models: The training and validation datasets can be uploaded to the Azure OpenAI models using SDK, REST API or Studio.

  1. Upload the fine-tuning files:
    import os 
    from openai import AzureOpenAI 
    client = AzureOpenAI(
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_version="2023-12-01-preview")  # This API version or later is required to access fine-tuning for turbo/babbage-002/davinci-002 
    training_file_name = 'your training file(jsonl)' 
    validation_file_name = 'your validation file (jsonl)' 
    # Upload the training and validation dataset files to Azure OpenAI with the SDK. 
    training_response = client.files.create(file=open(training_file_name, "rb"), purpose="fine-tune") 
    training_file_id = 
    validation_response = client.files.create(file=open(validation_file_name, "rb"), purpose="fine-tune") 
    validation_file_id = 
    print("Training file ID:", training_file_id)
    print("Validation file ID:", validation_file_id) 
  2. Create a fine-tuning job with the training file and the base model : training_file="training_file_id", model="base-model-name", hyperparameters={"n_epochs":integer,"batch_size":integer,
    "learning_rate_multiplier": (recommended range between 0.02to 0.2)})
  3. Check the status of the fine-tuning job:
    response = 
    print("Job ID:", 
    print("Status:", response.status) 

3. Test performance of the model 

Each fine-tune job generates a result file called results.csv that contains various metrics and statistics about your customized model's performance. You can find the file ID for the result file in the list of your customized models and use the Python SDK to get the file ID and download the result file for further analysis.  The details like step, training loss, train token accuracy, validation loss, validation accuracy etc. are provided to do a sanity check ensuring that the training happened smoothly where the loss should decrease, and accuracy should increase. 

4. Deploy the model to an endpoint:

The model can be deployed using SDK, REST API or Studio. The deployment of a model requires authorization, API version and request url. Below code snippet shows the token-based authorization where authorization token must be generated. The status of deployments can be obtained from the response of the deployment request. 

import json
import os
import requests  
token= os.getenv("")  
subscription = "" 
resource_group = ""
resource_name = "" 
model_deployment_name =" custom deployment name " # Name you will use to reference the model when making inference calls. 
deploy_params = {'api-version': "2023-05-01"}  
deploy_headers = {'Authorization': 'Bearer {}'.format(token), 'Content-Type': 'application/json'}  
deploy_data = { 
    "sku": {"name": "standard", "capacity": 1}, 
    "properties": { 
        "model": { 
            "format": "OpenAI",
            "name": <"fine_tuned_model">, #retrieve this value from the previous call       "version": "1" 
deploy_data = json.dumps(deploy_data) 
request_url = f'{subscription}/resourceGroups/{resource_group}/providers/Microsoft.CognitiveServices/accounts/{resource_name}/deployments/{model_deployment_name}' 
print('Creating a new deployment...') 
r = requests.put(request_url, params=deploy_params, headers=deploy_headers, data=deploy_data) 

Once the deployment is completed, it is ready for inferencing. If the requests are ad hoc or continuous or batch, the models will be hosted 24 x 7, which will incur hosting charges on Azure. The hosting charges varies between models and also is charged for the time it is hosted. 


Scenarios and proposed solutions:


Scenario 1:


Batch processing scenarios:

The business use cases like summarizing reviews, post call analytics, extracting relevant data or aspects from the reviews or analyze the data to gain insights, such as identifying trends, patterns, or areas for improvement are usually done through batch processing. The execution of these use cases might require complex prompt engineering while using out-of-the box models. However, these can be efficiently done using fine-tuned models tailored to meet specific requirements resulting in improved accuracy.

In batch processing scenarios, the processing of information happens during a specific pre-determined time. Hence, once we create a fine-tuned model following the Steps 1 – 3 mentioned above, an external time-based trigger can be set to start the deployment of the finetuned model.   

The deployment of the model can be done using SDK (Step 4) or REST API.  The external time- based trigger for deployment can be done in different ways like  

  1. Timer trigger for Azure functions:  The compute for running the code for deployment of the models is Azure Functions and the time trigger helps in running the Azure functions on a scheduled time.  
  2. Running the code using regular cron job schedule in Unix like systems 
  3. Deployment as part of a pipeline 

The deployment status can be obtained from the deployment provisioning state returned in response to the REST API call. When “provisoningState” = “succeeded”, the service will be ready for inferencing.  


In majority of the batch processing scenarios, the data for processing will be stored in a data store (eg blob or database) and the batch processing is completed once all the data is inferenced using the finetuned model and persist the extracted data/analysis into a data store.  Subsequently the model can be deleted. The deletion of the deployed model can be done using REST API or Azure CLI.




Scenario 2:

Near-real time scenarios: In certain business use cases, extraction/processing/generation of data must happen real-time during the operating hours of the business units. For example – a human agent led customer support call where data uploaded must be processed for extracting relevant information or retrieving information to address user queries, analyze customer inquiries and route them to the appropriate support agents or departments during business hours etc

In these scenarios, the fine-tuned model deployment can be done through an external trigger few minutes before the starting of the business hours. The finetuned model will be ready to do the specific tasks during the business hours and persists the extracted/processed /generated data depending on the task to a data store. Post the business hours, another external time-based trigger can delete the finetuned model deployment.




Scenario 3: 

There are certain use cases which are ad hoc requests and are not limited to business hours, but the response/outcome is not expected real-time. In these cases, the user might not wait for the outcome but expect the outcome to be send to the user/to the next workflow or persist in a location which could be downloadable at a later point in time. The examples of these scenarios could be processing of large documents (using Azure openAI models which support large input context or after relevant preprocessing steps or using RAG to find the relevant paragraphs) – eg legal documents, contracts, RFPs etc. to extract relevant information/analyze these large documents for which a list of queries are submitted, automated employee performance evaluation by analyzing quantitative and qualitative aspects before posting it to the manager for final review etc. Since these are ad hoc and the timing of the requests cannot be anticipated, time-based triggers will not be useful. We should be looking at event-based triggering of the finetuned model deployment. As soon as the user clicks a submit button or the user uploads the document, the deployment of the corresponding finetuned model must be triggered. 




The blocks are briefly explained below: 

Step 1: 

  1. Get the status of the deployment name

Step 2: 

Case 1: Status == Succeeded /Creating/Accepted 

  1. Send the request for inferencing with a retry logic: Inferencing can be done through REST API/SDK. ) 
  2. Once the inferencing is done, persist/publish the result in accordance with the use case. 
  3. Delete the deployment using REST API/SDK

Case 2: Status != Succeeded /Creating/Accepted  

  1. Deploy the fine-tuned model:  The model can be deployed either through Rest API / SDK/Azure CLI 
  2. Check the provisioning status. 
  3. If the status is “Succeeded”, send the inference request with retry logic. 
  4. Follow the Case 1, 2 & 3 steps.

The approach mentioned in this blog will help in reducing the hosting charges of fine-tune Azure OpenAI models, since they are not hosted 24 x 7. These deployment strategies could also be extended to other fine-tuned LLM models from the Azure (Model Catalogue) where the use cases match one of the two scenarios mentioned above and avoid continuous hosting of the models. 


This article was originally published by Microsoft's Azure AI Services Blog. You can find the original article here.