SuperRAG – How to achieve higher accuracy with Retrieval Augmented Generation

One of the most common use cases for generative is Retrieval Augmented Generation (RAG).  RAG enables you to inform the LLM about your business data without the need to retrain it.  It happens in 3 basic steps: 

  1. Retrieve relevant documents based on a query or chat message from your user.  This is usually done over a search enabled vector store such as Azure AI Search by creating embeddings of the query and performing a vector or hybrid search.
  2. Augmenting the LLM prompt with the retrieved documents to provide the required context and grounding data.
  3. Generating a response for the user question from the LLM based on the augmented prompt.

Research Shows that if the answer to the user's question is not in the first 5 documents in the prompt, the likelihood of generating a correct answer drops significantly.  For this reason, most RAG applications only return the top 5 search results and use them to augment to prompt.  This works well in most use cases but is entirely reliant on the retrieval step to return the correct document. What if the answer to the user's question is not in the first 5 documents?  How can you increase the number of retrieved documents without diluting the ability of the LLM to answer the question?

Introducing SuperRAG – More powerful than a vector store!

SuperRAG involves retrieving 50 (or some other large number of) documents in the retrieval step and then iterating though them to see if they answer the user's question.  The document is then scored based on this relevance and the relevant parts are extracted. The extracts and scores are then sorted, and the top five are used to augment the prompt in the traditional RAG method.

andrewditmer_4-1715620704568.png

The benefit of this approach is that it can dramatically increase the amount of information retrieved and increase the chances of finding the correct answer.  A vector search, which is commonly used in RAG applications, excels at making semantic connections like synonym recognition and misspellings, but doesn't really understand intent the way a human or LLM does.  So, by retrieving many more documents and letting an LLM like GPT-3.5 decide if the document answers the question, we can achieve higher accuracy with our generated answers.

One drawback to this approach is it can be slower and more expensive than traditional RAG.  Because we must send each document to the LLM, we will incur a latency penalty and increased token cost, however, the latency can be mitigated to some degree by evaluating the documents in parallel.  Provisioned Throughput Units (PTUs) can also help lower the latency and, if fully used around the clock, lower the token costs.

Let's see it in action

In this example we will try to answer this question:

'Does the applicant have any significant illnesses in his medical history?'

With these two sample documents:

'Please use application form 354-01 to enter applicants' medical history, significant illnesses and other symptoms.'

'Mr. John Doe, a 35-year-old non-smoker, is applying for a life insurance policy. He works as an accountant and leads a low-risk lifestyle. He exercises regularly and maintains a healthy diet. His medical history reveals no significant illnesses, and his family history is also clear of any hereditary diseases. He is interested in a policy with a coverage amount of $500,000'

If we do a cosine similarity comparison of the vector representations for this text (like we would for a traditional vector search), we would get the following results:

# Determine the Cosine Similarity of the query and answers (to understand semantics vs intent)
question_emb = generate_embedding('Does the applicant have any significant illnesses in his medical history?')

answer_1_emb = generate_embedding('Please use application form 354-01 to enter applicants medical history, significant illnesses and other symptoms.')
answer_2_emb = generate_embedding('Mr. John Doe, a 35-year-old non-smoker, is applying for a life insurance policy. He works as an accountant and leads a low-risk lifestyle. He exercises regularly and maintains a healthy diet. His medical history reveals no significant illnesses, and his family history is also clear of any hereditary diseases. He is interested in a policy with a coverage amount of $500,000')

print("Cosine Similarity of Question to Answer 1:", 1 - cosine(question_emb, answer_1_emb))  
print("Cosine Similarity of Question to Answer 2:", 1 - cosine(question_emb, answer_2_emb))  
Cosine Similarity of Question to Answer 1: 0.5595185612023936
Cosine Similarity of Question to Answer 2: 0.39874486454438407

So, a traditional vector search would rank document 1 higher, indicating it is more relevant to the question. This is obviously incorrect.  Document 1 does not answer the intent of the question, but document 2 does.

Instead of just using cosine similarity, let's now use our LLM to evaluate the documents as well.  Here is the prompt we'll use:

By using this prompt and GPT-3.5 to evaluate the documents, we can see that document 2 is much more relevant to answering the user's question:

I am going to supply you with a set of potential answers and your goal is to determine which of them is best able to answer the question: n Does the applicant have any significant illnesses in his medical history?        Please respond in JSON format with a "confidence" score for each example indicating your confidence the text answers the question as well as the "id" of the text.  
Please also include a field called "relevant_text" which includes the text that is relevant to being able to answer the question.  
Each example will include an answer id as well as the text for the potential answer, separated by a colon.  
 
1: Please use application form 354-02 to enter applicants medical history, significant illnesses and other symptoms.
2: Mr. John Doe, a 35-year-old non-smoker, is applying for a life insurance policy. He works as an accountant and leads a low-risk lifestyle. He exercises regularly and maintains a healthy diet. His medical history reveals no significant illnesses, and his family history is also clear of any hereditary diseases. He is interested in a policy with a coverage amount of $500,000

By using this prompt and GPT-3.5 to evaluate the documents, we can see that document 2 is much more relevant to answering the user's question:

{
    "answers": [
        {
            "id": 1,
            "confidence": 0.1,
            "relevent_text": "Please use application form 354-02 to enter applicants medical history, significant illnesses and other symptoms."
        },
        {
            "id": 2,
            "confidence": 0.9,
            "relevent_text": "His medical history reveals no significant illnesses, and his family history is also clear of any hereditary diseases."
        }
    ]
}

Now, if we were to scale this from 2 documents to 50, 100, or 1,000 documents depending on our business needs, we could dramatically improve the accuracy of our RAG application.  Since each document is given a confidence score, we can easily re-sort the results and pass on the most relevant documents to our LLM to generate the answer.

The big benefit of using SuperRAG is not only can you drastically increase the amount of data you retrieve, but you can also extract the parts of each document that are relevant to answering the question.  This makes your final prompt much more focused giving your generated answer much higher precision.

If you'd like to learn more about SuperRAG or see a complete example, check out this Github repo.

 

This article was originally published by Microsoft's Azure AI Services Blog. You can find the original article here.