Fine Tuning with Function Calling on Azure OpenAI Service

Following our recent update on the new features and capabilities of the Azure OpenAI (AOAI) service, this blog focuses on fine tuning with function calling. We'll do a deep dive into how fine tuning with function calling works, when you might want to use it, and provide an in-depth case study using stock price data.

In this blog we'll talk about…

  • What is function calling in Azure OpenAI Service
  • What does fine tuning have to do with function calling
  • Use case: Finetuning with function calling for Stock Prices
  • with function calling

What is function calling in Azure OpenAI?

Function calling refers to the capability to define and describe calls to external application programming interfaces (APIs). With function calling, you can instruct your Language Model to utilize these APIs when appropriate, based on the context provided by the prompt. This functionality expands the LLM's abilities by allowing it to interact with external services, access additional data sources, or perform specific tasks beyond its built-in capabilities.

On AOAI, the newest versions of Open 's gpt-35-turbo and gpt-4 now support function calling. When functions are requested, the model evaluates the context to decide if any should be used, providing a JSON object with the function details. It also allows parallel function calls, executing tasks simultaneously and reducing the number of API requests for better performance.

Typical scenarios where function calling is applied involve:

  • Creating assistants capable of answering queries through external API calls
  • Translating natural language requests into API interactions
  • Parsing structured data from test inputs

Please note, function calling triggers the API call as required but does not execute it directly. Instead, your application handles the call and returns the response to the language model. This approach empowers you to manage external calls, ensuring control over your application's interactions.

What does fine tuning have to do with function calling?

Fine tuning with function calling teaches your model how- and when – to call external APIs. gpt-35-turbo (0613) and newer models support function calling in both training data and inferencing. So, now both customized models and base models can make calls to external API. Fine tuning with function calling offers a multitude of benefits. Here is a list of some of the important benefits:

  • New Skills: Teach your model when to make function calls, or what to do with the results.
  • Cost savings: Shorten the descriptions and signatures for function calls to reduce prompt length. Fine tuning presents the opportunity to streamline token usage, particularly beneficial when dealing with numerous verbose functions. Achieve a more resource-efficient and optimized model by leveraging fine tuning techniques.
  • Enhanced accuracy, reliability and responsiveness: elevate the precision and reliability of model outputs. Fine tuning enables the model to produce more accurate and consistent results specially in dynamic scenarios, fostering confidence in the system's capabilities. 

Fine tuning with function calling is currently available for the gpt-35-turbo (0613) and gpt-35-turbo-16k (1106) models. With support for function calling, you can incorporate functions into your training data, and have your fine-tuned model make function calls.

Besides the dataset, the experience for fine tuning and function calling is the same as training any other model for fine tuning. See the documentation for more details.

Case Study: Finetuning with function calling for Stock Prices

To demonstrate the utility of function calling with fine-tuned models, let's use a real problem as a case study. We want to build a chatbot that retrieves stock prices from an external API, in response to user inquiries. With just the base model, we identified two challenges: (1) the model does a poor job at distinguishing real companies from fake, and (2) our function calling definitions were very long – and increased our tokens per prompt dramatically.

We'll explore how we can use fine tuning, with function calling, to improve the model's accuracy and performance. For each scenario, we'll build a training dataset, compare the fine-tuned model to the base model, and measure the improvement from fine tuning.

Once we've created a fine tuned model that meets our needs, we'll put it all together by developing a basic application that allows users to check stock prices for different companies. We will use YFinance Python library for easy retrieval of current stock prices.

Scenario 1: Hallucination

A common problem with large language models is hallucinations – providing plausible but false responses. With function calling, hallucinations can happen when the model calls a function in the wrong context or provides incorrect information to for the function call.

We evaluated whether the base model was able to correctly identify fake companies, and respond appropriately, instead of trying to quote a stock price. Our test dataset consists of 10 samples, comprising 5 fake and 5 real companies. Even though we provided a clear system message not to make assumptions (asking for clarification if the exact stock ticker symbol isn't found) the base model struggled to differentiate between fake and real companies accurately. Please see the example below, where the base model generated a fake symbol for Titan Robotics and output a function.

Inference with base model – gpt-35-turbo (0613)

{“role”: “user”, “content”: “What was the closing price of Titan Robotics' stock last Friday”},

Hallucination output.png

We need to teach the model when to make function calls – and when to decline. Fine tuning to the rescue!

To address hallucination and enhance accuracy, we created a training dataset with function calling capabilities. Each line of this dataset includes sets of “messages” (from user, system, and assistant roles) paired with stock functions. We included fake company examples, with appropriate response, so we can teach our model identify and respond to those fake requests. Our dataset consists of 96 samples.

Hallucination Sample.png

We trained a gpt-35-turbo (0613) model using a combination of different hyperparameters and evaluated it with the same test dataset. While our base model did a poor job of distinguishing between real and fake companies, our finetuned model intelligently identified invalid company entries. Please see the Titan Robotics example for reference.

Hallucination - FTed output.png

The table below illustrates the outcomes of the test dataset evaluation. It clearly demonstrates how fine tuning can identify hallucination and deliver more accurate and reliable results.

Test Dataset Base Model

gpt-35-turbo (0613)

Fine-Tuned Model

gpt-35-turbo (0613) finetuned

Real Companies 5/5 examples detected correctly 5/5 examples detected correctly
Fake Companies 0/5 examples detected correctly 4/5 examples detected correctly
Overall Accuracy 50% 90%
Hallucination Accuracy 0% 80%

While the fine tuned model is not perfect, it is significantly better than the base model. Depending on your use case, and your need for accuracy, you may choose to fine tune with more data to get even better performance.

Scenario 2: Token optimization

The inclusion of functions in the system message directly impacts token usage. As the number of functions grows, so does the number of tokens within the system message, resulting in verbose prompts and increased costs.  Fine tuning lets you shorten your function calls by:

  • Omitting function and parameter descriptions by removing the description field from functions and parameters.
  • Omitting parameters entirely by removing the properties field from the parameters object (keep the properties field with an empty dictionary).
  • Excluding a function by removing the entire function object from the functions array.

Without fine tuning, the model may struggle to correctly use the function without this additional information, but with fine tuning you can show the model when to call the function without explaining as much in the prompt.

In our two stock functions, we have achieved a noteworthy 55% reduction in tokens by eliminating the description field from both functions and parameters, and by removing the properties field (keep the property field but with an empty dictionary) from the parameters object within each function. Below is the updated, shortened function.


To kickstart our testing process, we first establish a baseline. We'll proceed with three phases:

  • Initially, we'll assess our test dataset by inferencing through the base model gpt-35-turbo (0613) using verbose function definition (full function).
  • In the second phase, we'll examine the performance of the base model gpt-35-turbo (0613) with shortened functions.
  • Finally, in the third phase, we'll evaluate the fine-tuned model using the shortened function definitions.

Let's begin with establishing the base model and full verbose functions as our baseline. The base model, gpt-35-turbo (0613), exhibited 100% accuracy with our test dataset, indicating its ability to generate the correct function when provided with complete prompts. However, when we transitioned to shortened functions while keeping the base model unchanged, it showed 0% accuracy, failing to detect any samples correctly and providing empty arguments in all 10 samples.

{“role”: “user”, “content”: “what is the current price of Uber?”}

base token reduction.png

{“role”: “user”, “content”: “What was the highest price that Walmart's stock reached last quarter?”}

base token reduction2.png

To investigate whether fine tuning could address this issue, we constructed a dataset comprising 100 samples containing both shortened stock functions. We experimented with various combinations of system messages and hyperparameters to enhance the accuracy of the fine-tuning process. Finally, we successfully fine-tuned a model that achieved 100% accuracy with our test dataset when using our shortened functions. Please refer to the table summary below and the output of the fine-tuned model for further details.

{“role”: “user”, “content”: “what is the current price of Uber?”}

FTed token reduction.png

{“role”: “user”, “content”: “What was the highest price that Walmart's stock reached last quarter?”}

FTed token reduction2.png

Base Model + Verbose Base Model + Short FT Model
Accuracy 100% 0% 100%
Number of tokens 230 108 108

 Calculating the total cost of ownership: do shorter prompts save money?

When considering the cost trade-off between fine tuning with shortened function and using a base model with full verbose function, it is essential to assess factors such as the number of requests and associated costs. The base model has a higher per-prompt cost, due to length, but with fine tuning, we pay for both tokens and hosting the model. For our stock use case, the plot below compares the cost of fine tuning versus the base model: with many requests per day, fine tuning is less expensive than the base model! MahsaR_8-1708711517018.png

Wiring it all up: using our fine-tuned function calling models in an e2e application

Function calling only creates the call to an external API – it doesn't execute it. To actually execute the request, you'll need to extract the function name and arguments from the LLM response and proceed to call the function with those arguments. The function's output is in JSON format, which is then passed back to gpt-35-turbo to generate an appropriate result message for the user.

App 2.png

Best Practices with Function Calling:

Although we may have made it look easy, getting quality examples that worked and were better than the base models required a lot of iteration and experimentation. We ran many trial models to identify the best performing one for each use case. A few recommendations, based on our experience:

  • Adequate training examples: While fine tuning will run with as few as ten examples, but it's recommended to provide at least 100 high-quality training examples for optimal performance. Many use cases will require thousands of examples! You can also re-train previously fine tuned models with more examples.
  • Consistent function definitions: maintain consistency in function definitions between training and inference to guarantee accurate and reliable responses.
  • Explore different parameter combinations: While we provide default parameters, you should experiment with a range of parameters to improve performance. You'll likely want to adjust your learning rate multiplier and number of epochs.
  • Refine function definitions: enhance clarity and distinctiveness in function definitions. Clear and well-defined functions set the stage for improved accuracy. In our stock use case, clear and detailed instructions in system message significantly improved the performance of both the base and fine-tuned models.

When deploying your applications, consider

  • Token management: Be mindful of token usage, and if needed, explore strategies such as omitting descriptions or entire functions to stay within token limits.
  • Real world impact assessment: when using function calling in language models, ensure responsible usage by validating function calls, using trusted data sources, and following the Principle of Least Privilege.
  • User confirmation Steps: Consider real-world impacts and implement user confirmation steps for actions with consequences to enhance control and security.

Want to learn more?


This article was originally published by Microsoft's Azure AI Services Blog. You can find the original article here.