Evaluating Small Language Models for RAG using Azure Prompt Flow (LLama3 vs Phi3)


Recently, small language models have made significant progress in terms of quality and context size. These advancements have enabled new possibilities, making it increasingly viable to leverage these models for retrieval-augmented generation (RAG) use cases. Particularly in scenarios where cost sensitivity is a key consideration, small language models offer an attractive alternative.  

This post demonstrates use Azure ML Prompt Flow to create a Q&A solution using Small Language Models (SLMs) with your own data. It also compares the performance of two small language models: Phi3-mini-128K and Llama3-8B. 

What is RAG? RAG stands for Retrieval Augmented Generation. It means adding context to the LLM input prompt by retrieving information from a corpus of data, hence grounding the model with your information.  RAG helps to avoid the problems of the LLM making up facts or using outdated knowledge. With the new capabilities of SLM's, like a context window of up to 128K tokens, using SLM's for RAG solution is becoming more feasible. 


Small language models (SLM's) are more streamlined versions of LLMs, with fewer parameters and simpler architectures. SLM's can be designed to process data locally and can be deployed on mobile devices.  They can be trained with relatively small datasets and are more explainable which means RAG techniques can greatly enhance the performance and experience of an SLM. 

– The Benefits of Small Language Model:

Less complex and more computationally efficient models like SLMs can excel at simple tasks such as:

  • Offline settings, on-device or on-prem, where local inference may be required.
  • Scenarios with latency constraints where quick response times are essential.
  • Tasks/use cases with cost limitations, especially those with simpler tasks.
  • Environments with limited resources.
  • Specific tasks can be achieved better through fine-tuning SLMs (vs. large model out-of-box)


– Benchmarking and evaluation:

Before deploying any solution to production, it's recommended to assess and compare all the possible alternatives to address the use case, including small language models. This way, you can decide based on the balance between accuracy and cost.


  1. Prepare the evaluation dataset:

We will use subset of Mini Wiki dataset as an evaluation dataset which includes:

–  Source docs: Each Wiki article can be saved in a text file to make the indexing and vectorization easy.

–  Q&A: In this blog we will use a subset of 50 questions and answers and save them as jsonl file. Here is the format:

{"question":"The Celsius crater on the Moon is what?","answer":"named after him","id":148}

{"question":"Is the Celsius crater on the Moon named after him ?","answer":"Yes","id":149}

  1. Azure Search – Build the index:

You can use the integrated vectorization feature in Azure Search to index the source docs prepared in the first step: Announcing the Public Preview of Integrated Vectorization in Azure AI Search – Microsoft Community H…


  1. Model Catalog in Azure ML Studio – Deploy SLM Inferencing Endpoints:

From the model catalog in Azure ML Studio, deploy two inferencing endpoints for both “Phi3-mini-128k-instruct” and “Meta-Llama-3-8B-Instruct”. It may take up to 20 min to get the endpoint up and running.







  1.  Prompt flow – “Custom” connections:  create 2 Custom connections to the end points created in the previous step:


When creating Custom connections, the required keys to set are:

– endpoint_url

This value can be found at the previously created Inferencing endpoint.

– endpoint_api_key

Ensure to set it as a secret value.

This value can be found at the previously created Inferencing endpoint.


Supported values: LLAMA, DOLLY, GPT2, or FALCON

This value is dependent on the type of deployment you're targeting. In our case it's LLAMA for both models.

5. Prompt flow – Create Q&A on your data flow: clone the prompt flow “Q&A on your own data” template and start the runtime. you need to start the runtime before completing the next steps.


6. Prompt flow – Update “Lookup”: Connect “Lookup” which retrieves the source docs from the index created in step 2.


7. Prompt flow – Add “Open Model LLM” connector: you can find it from “More tools” button:

By default, the Prompt flow Q&A on your own data template uses “LLM” component to call Azure-Open- inferencing endpoint. Since we are using SLMs (MaaP) endpoint, we need to replace the template's default “LLM” with “Open Model LLM” component to connect to the language models.
In the new “Open Model LLM” component, use the “Custom connections” created in an earlier step to connect to Llama3 then add another variant to connect to Phi3. This is how the final flow should look like.



8. Prompt flow – Tune the prompt: Here is the prompt we used to make the answer as short as possible which aligns with the ground_truth answers we are using.

You are an AI assistant that helps users answer questions based on a specific context. 
Your answer should be a one short sentence, as precise as possible and should only come from the context. 




9. Prompt flow – Evaluation and benchmark:

We have prepared the Q&A evaluation data from step 1 and two SLMs. To compare how well they perform, we can use the evaluation feature in Prompt flow, which makes it simple and convenient. We will use 2 built-in metrics:

– GPT Similarity: GPT4 judges how good the output is compared to the Ground truth.

– Ada similarity: Cosine similarity between the embeddings of output and GT.

So, we need Azure OpenAI GPT-4 endpoint and embedding endpoint as prerequisites to complete the evaluation setup.

To setup the evaluation, there are 4 steps:

– Evaluation Step 1- Basic settings:


– Evaluation Step 2- Upload eval data:


– Evaluation Step 3- Evaluation metrics:


– Evaluation Step 4- Configure the evaluation: important note – the Ground truth needs to be set as the answers from the evaluation data we uploaded.  



With Prompt flow evaluation tool, you can inspect questions, responses and scores of different variations (prompts or models) at sample level as well as over the entire dataset. For this use case, using this evaluation dataset which consists of 50 questions, “Phi3-mini-128k-instruct” and “Meta-Llama-3-8B-Instruct” models achieved comparable ada-similarity and gpt-similarity score.






gpt-similarity (max is 5)



Screenshot of Comparing the batch runs of the 2 variants (Llama3 and Phi3) in Azure AI studio:



This blog post discusses the benefits of using Small Language Models (SLMs) in certain scenarios. SLM's are more computationally efficient than larger models. They are particularly useful in scenarios with latency constraints where quick response times are essential, in tasks/use cases with cost limitations, and in environments with limited resources. The blog post also discusses the importance of assessing and comparing all possible solutions, including SLM's, before deploying to production.  This should be an effort to balance quality and cost. Additionally, it provides a guide of use Azure Prompt Flow to build a Q&A on your own data solution with SLM's and compare the performance of different options of small language models, Meta-Llama-3-8B-Instruct and Phi3-mini-128k-instruct. 


This article was originally published by Microsoft's AI - Machine Learning Blog. You can find the original article here.