How To Control Azure OpenAI Models

image

The first step is to choose the correct model and configure its parameters appropriately. Depending on your use case, you need to determine which model can deliver the best quality. This requires knowledge of the available models and their capabilities. Once you have selected the right model, focus on the parameters. For instance, setting the temperature to 0 will prevent the model from generating creative responses. Another important parameter is the max_token parameter which is crucial for optimizing latency performance.  For retrieval-augmented generation use cases, it can be set a value between 500 to 800 tokens as a . This parameter also affects the number of API calls you can make per minute, so finding the optimal value is essential. Additionally, the stop_sequence parameter allows you to define when the model should stop. You need to review and adjust other parameters as needed to ensure they meet the requirements of your use case.

After selecting the correct model and configuring its parameters appropriately, the next step is prompt engineering. What is prompt engineering? Prompt engineering is the process of improving the quality of prompts through various techniques. It is essential to understand prompt engineering techniques thoroughly and refine your prompts iteratively to achieve the best results.

The quality of the input prompts we send to Azure OpenAI models directly influences the quality of the responses we receive. Prompts are the text inputs that define our expectations, and the output generated by the model depends on the prompt. Outputs can include completions, conversations, or embeddings, depending on the Azure OpenAI model used. Azure OpenAI models use natural language instructions and examples in the prompt to identify the task. The model then completes the task by predicting the most probable next text. This technique is known as “in-context” learning, which operates without altering the actual weights of the model.

Let's explore the :

  1. There are zero-shot, one-shot, and few-shot learning techniques:
  • Zero-shot: Predicting with no sample provided.
  • One-shot: Predicting with one sample provided.
  • Few-shot: Predicting with a few samples provided.

When using the few-shot learning technique, the models are not retrained in the traditional sense. Instead, they calculate predictions based on the context included in the prompt. There is no change in the model's weights; the examples are learned on the spot.

2. Place instructions at the beginning of the prompt, and use ### or “””  (any special characters) to separate the instructions from the context.

3. Be specific, descriptive, and detailed about the desired context, outcome, length, format, style, etc. For instance, you have to use responsible instructions in your meta prompt.

4. Break complex tasks into simpler subtasks.

5. Instead of merely stating what not to do, clearly specify what to do.

6. Prompt the model to explain its reasoning before providing an answer (chain of thoughts).

7. Expand the model's knowledge by integrating other tools.

After executing these two steps, evaluate whether you are satisfied with the outputs and accuracy. If not, consider whether your use case involves Retrieval-Augmented Generation (RAG). Specifically, ask yourself if you need to use your data to answer questions or generate content. If the answer is yes, then you need to implement the RAG pattern. RAG is a feature that enables you to harness the power of Large Language Models (LLMs) with your enterprise data. With RAG, you can use Azure OpenAI (AOAI) to generate text, summarize information, and chat in the context of your customer's knowledge base, effectively grounding the data. This technique allows the same LLM model to function as a reasoning engine over new data, enabling in-context learning. By providing the context in your prompt and instructing AOAI to answer based solely on the given context, you ensure more accurate and relevant responses.

Lets assume that you've tried all three steps: selected the correct model, set the appropriate parameters, and applied proper prompt engineering techniques and also checked if your use case is suitable for RAG. However, if you still struggle with accuracy, it may be necessary to modify the behavior of the LLM through supervised fine-tuning (SFT). In these situations, you will need an SFT dataset, which is a collection of prompts and their corresponding responses. These datasets can be manually curated by users or generated by other LLMs. Your fine-tuned Azure OpenAI models will be available exclusively for your use, ensuring tailored performance to meet your specific requirements.

 

Furthermore, you may also consider using RAG with your fine-tuned model. Combining RAG with a fine-tuned model can help you achieve higher accuracy and better performance tailored to your specific needs.

 

In light of this, when presented with a GenAI utilization scenario, it becomes imperative to methodically contemplate each sequential step as part of a structured cognitive process. This approach aims to identify and implement the most effective solution with a focus on achieving optimal performance metrics.

 

This article was originally published by Microsoft's Azure AI Services Blog. You can find the original article here.