Evaluating Large and Small Language Models on Custom Data Using Azure Prompt Flow

The Evolution of AI and the Challenge of Model Selection

In recent years, the field of Artificial Intelligence (AI) has witnessed remarkable advancements, leading to the unprecedented surge in the development of small and large language models. They're at the heart of various applications, aiding in everything from customer service chatbots to content creation and software development. These models offer developers a plethora of options, catering to a wide array of applications. However, this abundance also introduces complexity in choosing the right model that not only delivers optimal performance on specific datasets but also aligns with business objectives such as cost efficiency, low latency, and content constraints.

The Imperative for a Robust Evaluation Pipeline

Given the diversity in language models, it's crucial to establish an evaluation pipeline that objectively assesses each model's efficacy on custom data. This pipeline not only aids in discerning the performance differentials between large and small language models (LLMs and SLMs) but also ensures that the selected model meets the predefined business and technical thresholds.


implementation details to build an Evaluation Pipeline using Azure (AML) and Prompt Flow (PF)
For this demonstration, we will use the phi-3-mini-4k-instruct model. The approach outlined here is adaptable and can be applied to other models as well including fine-tuned models.

Deploying the Model to an Azure Managed Online Endpoint

Deploying your language model to an Online Endpoint is a critical step to make models available for inference in a scalable and secured manner and is the first step in setting up a robust evaluation pipeline. Azure Managed Online Endpoints provides a secure, scalable, and efficient way to deploy and manage your AI models in production. The Model Catalog within AML Studio offers a host of LLM and SLM models from Meta, Nvidia, Cohere, Databricks and a wide range of other providers. These models are available in MLFlow format with one-click deployment option both Model As a Platform(MaaP) and Model as a Service (MaaS) while also providing seamless deployment options for hosting your own custom models or bringing in from HF. Refer the azure documentation with instructions on deploy a model to Managed Online Endpoint.

Here we will deploy phi3-4k-instruct-model from Model Catalog on a Managed Online Endpoint.


Configure parameters like max_concurrent_requests_per_instance and request_timeout_ms correctly to avoid errors (429 Too Many Requests and 408 Request Timeout) and maintain acceptable latency levels. Refer the detailed guidance here.

Sample code snippet:-


Introduction to Prompt Flow and Batch Evaluation

 Microsoft Azure's Prompt Flow, a powerful tool within Azure , is designed to streamline the creation and management of models. It provides an intuitive interface that guides users through the model creation process, from data ingestion and preprocessing to training and deployment. By orchestrating executable flows with LLMs, prompts and Python tools through a visualized graph, it simplifies the testing, debugging and evaluation of different prompt variants simplifying the prompt engineering task. Prompt Flow (PF) empowers developers and data scientists to focus more on strategic tasks and less on operational complexities. This tool is particularly useful for teams looking to accelerate their machine learning and gen ai lifecycle and deploy scalable models efficiently.

Prompt Flow offers a suite of prebuilt evaluation metrics tailored for GenAI based models, including Groundedness, Relevance, Coherence, Fluency, and GPT-based ranking on reference data, alongside traditional metrics like the F1 score. These metrics provide a comprehensive framework for assessing the performance of LLMs, SLMs, and Azure OpenAI models across various dimensions. It provides a seamless way to extend this list and add custom metrics like Bleu score, Rouge score, precision, recall and others. This flexibility allows users to extend the evaluation process by incorporating unique metrics that cater to specific business requirements or research objectives, thereby enhancing the robustness and relevance of model evaluations. Below are some of the built-in metrics available within PF:


Setting Up the Prompt Flow Evaluation Pipeline

Step 1: Create connections for AOAI, Embedding models and custom model. Also, create connection to the Knowledge base.

For custom and open-source models, establish a custom connection by providing details about the inference server endpoint, key, and deployment name.


For Knowledge base, AI Search service from Azure provides secure information retrieval at scale over user-owned content in traditional and generative AI search applications. Chunk and index your documents using AI Search and use the built-in Azure AI Search connector within PF to establish a connection to the knowledge base.

Step 2: Utilize Multi-Turn Q&A Flow from PF gallery

PF gallery provides a range of pre-built flows that can be cloned and customized or one can also build it from scratch. We will clone the multi-turn Q&A flow available within the gallery for a streamlined setup.


Step 3: Runtime: Select existing runtime or create a new one

Before you start authoring, you should first select a runtime. Runtime serves as the compute resource required to run the prompt flow, which includes a image that contains all necessary dependency packages. It's a must-have for flow execution.

You can select an existing runtime from the dropdown or select the Add runtime button. This will open up a Runtime creation wizard. Select an existing compute instance from the dropdown or create a new one. After this, you will have to select an environment to create the runtime. We will be using default environment to get started quickly.

Step 4: Map the input and output of each node

Ensure each node points to the right indexes, AOAI, and custom connections.

Step 5: Modify the prompt variant to set the bot tone, personality, postprocessing of retrieved context and any additional formatting if required.

Sample Prompt:-

You are an AI assistant that helps users answer questions given a specific context. You will be given a context and asked a question based on that context. If the information is not present in the context or if you don't know the answer, simply respond by saying that I don't know the answer, please try asking a different question. Your answer should be as precise as possible and should only come from the context.
Context : {{context}}
Question : {{question}}
AI :

Step 6: Add Python Tool as a new node and replace the existing code with the one below :
It essentially makes an API call to the managed endpoint using the custom connection created in step 1 and parses the fetched response. Validate and parse the input, map all the connections, and save.

def my_python_tool(message: str, myconn: CustomConnection) -> str:
    # Get authentication key-values from the custom connection
    url = myconn.api_base
    api_key = myconn.api_key

    data = {"input_data": {
        "input_string": [
                "role": "user",
                "content": message
                "parameters": {
                    "temperature": 0.3,
                    "top_p": 0.1,
                    "max_new_tokens": 200
    body = str.encode(json.dumps(data))

    headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key), 'azureml-model-deployment': myconn.deployment }

    req = urllib.request.Request(url, body, headers)

        response = urllib.request.urlopen(req)
        result = response.read()
        result = json.loads(result.decode('utf-8'))["output"]
        return result
    except urllib.error.HTTPError as error:
        return "The request failed with status code: " + str(error.code)

Step 7: Execute the Flow

Run the flow to ensure all connections are operational and error-free.


Step 8: Evaluate the Model

Click on ‘Evaluate', select the desired metrics, and map the data accordingly. If using GPT-based metrics like Similarity-Score, ensure that the AOAI model is deployed and specify the deployment details in the evaluation section.

Step 9: Submit and Review Results

After you finish the input mapping, select on Next to review your settings and select on Submit to start the batch run with evaluation. After submission, you can find the batch run details in the run tab in PF page. Click on view output to check the response and associated metrics generated by the flow and if you want to download the results, you can click on “Export” tab and download the outputs in .csv format.

Step 10: Adding custom metrics to the list of built-in evaluation metrics

Create a new PF from the gallery, but this time select the “Evaluation Flow” card within the gallery section and select one of the evaluation flows. Here, we will select “QnA Groundedness Evaluation” from the gallery and add a custom BLEU score to measure the similarity of machine generated text with reference text. Add a new python tool to the end of the flow, replace the code with below snippet, update the mapping and save the flow.

Note:- If using any third-party dependency make sure to add the required library in requirements.txt within the Files section. In the below example, I am using nltk, hence update the same in requirements.txt

from promptflow import tool
from nltk.translate.bleu_score import sentence_bleu

def get_bleu_score(groundtruth: str, prediction: str):
    ref_tokens = groundtruth.split()
    pred_tokens = prediction.split()
    return sentence_bleu(ref_tokens, pred_tokens)

Step 11: Submit a new batch evaluation

This time the newly added custom metric should show up in the customized evaluation section.

Select the required metrics, rerun the evaluation and compare the results.

Through Azure's robust infrastructure and Prompt Flow, developers can efficiently evaluate different language models on custom datasets. This structured approach not only helps in making informed decisions but also optimizes model deployment in alignment with specific business and performance criteria.


What is Azure Machine Learning prompt flow – Azure Machine Learning | Microsoft Learn

Deploy machine learning models to online endpoints for inference – Azure Machine Learning | Microsof…


This article was originally published by Microsoft's AI - Machine Learning Blog. You can find the original article here.