The limits of LLMs and how RAG remedies them

Terms like Retrieval Augmented Generation (RAG) and Large Language Models (LLMs) have attracted a lot of attention recently - and not surprisingly. The evolution of humans communicating with machines seems to have become completely normal. However, solely "speaking" with language models such as.

  • GPT-3
  • GPT-4
  • Llama 2
  • Mistral-7B

is not what is extraordinary about this situation. What is extraordinary is that these machines - LLMs in this case - understand you. Or do they?

Try it out:

Have a language model of your choice explain the result of the latest election polls in the election trend for the federal election, and you would like to have both positive and negative effects included in the answer.

Did it work?

In this blog post, we will show you ways to get a reliable answer to questions like these.

The limit of LLMs in answering questions.

Understanding the context of a human user query is not a simple trick, but a technically highly complex approach based on a combination of external retrieval systems (= system for recovering specific information from stored data) and Large Language Models (LLMs).

What dimension of information can LLMs understand and process, and what dimension can they not?

At what point LLMs fail to answer questions is something we will explore in more detail in this blog post. We will also show you how real-time information can be added to Large Language Models.

Looking for more information on using LLMs to develop Konfuzio's DocumentGPT? Read the informative blogpost DocumentGPT - Unleash the power of LLMs and learn more.

Concrete limits of LLMs

Language models offer productivity gains and help us with various tasks. But as mentioned earlier, be aware that even AI-powered LLMs have their limitations. These become particularly apparent when

  • timely or current information,
  • Real-time information,
  • private information,
  • Domain-specific knowledge,
  • Underrepresented knowledge in the training corpus,
  • legal aspects and
  • linguistic aspects

be requested. For example, ask ChatGPT about the current inflation rate in Germany. You will get - similar to the test above - an answer like this:

"I apologize for the confusion, but as an AI language model, I do not have real-time data or browsing capabilities. My answers are based on information available through September 2021. Therefore, I cannot tell you the current inflation rate in Germany."

This limitation poses a major problem. ChatGPT, like many other LLMs, is unable to provide timely and contextual information that may be critical for making informed decisions.

This cause is behind the LLM Limits

The reason LLMs are "stuck in time" and unable to keep up with the rapidly evolving world is:

The training and information data of ChatGPT has a so-called "cutoff point". This point in time or cutoff date is set to September 2021 for this language model. So if you ask ChatGPT about events or developments that occurred after this date, you will either get

  • convincing sounding but completely false information, which is known under the term "hallucination" or
  • Unobjective responses with implied recommendations, such as.

"My data only extends to September 2021, and I do not have access to information about events that occurred after that date. If you need information on events after September 2021, I recommend accessing current news sources or search engines to track the latest developments."

RAG as a solution to the LLM limit problem

This is exactly where Retrieval Augmented Generation (RAG) comes in. This approach closes the knowledge gap of LLMs and enables them to provide contextually accurate and up-to-date information by integrating external retrieval mechanisms.

In the following sections, we explain the concept of RAG in more detail and explore how RAG extends the boundaries of LLMs.

What is Retrieval Augmented Generation?

RAG Retrieval Augmented Generation Definition

Retrieval Augmented Generation - RAG for short - is a method in the artificial intelligence (KI) and the natural language processing, which aims to improve the performance of LLMs by integrating external retrieval systems. The technique allows retrieval of data from external sources, e.g., organizational corpora or document databases, and is used to enrich the data used to condition the language model (LLM). prompts.

How does Retrieval Augmented Generation work?

RAG leverages the power of transformers such as GPT-3 or GPT-4 in conjunction with external retrieval or search mechanisms. Instead of relying only on the internal knowledge of the model, RAG asks a external record typically a corpus of documents, to retrieve relevant information. This retrieved data is then used to generate a contextual response.

RAG vs. finetuning

RAG enables models to retrieve information from external sources to better understand the context of user queries and generate more accurate responses. It extends the capabilities of LLMs by connecting to knowledge bases or other information sources.

Finetuning is a process in which an already pre-trained base model, such as a Large Language Model, is adapted to specific tasks or domains. This is done by further training the model on a limited set of task-specific training data. During the fine-tuning process, the model learns how best to focus on a specific task or domain and optimizes its capabilities for that particular application.

Main difference between RAG and Finetuning lies in their operation and purpose

RAG focuses on improving natural language processing by integrating external information, enabling the model to better understand the context of queries and generate more accurate responses. Finetuning, on the other hand, aims to specifically adapt a pre-trained base model for a particular task or domain by drawing on a limited set of training data.

Both methods are useful, but they have different application areas and goals. RAG extends the capabilities of LLMs by integrating external information, while fine-tuning aims at customization for specific tasks or domains.

RAG comparison - advantages, disadvantages and alternatives

RAG provides a cost-effective and efficient alternative to traditional methods such as pre-training or fine-tuning base models. RAG essentially empowers large language models to directly access specific data when responding to specific prompts. To show the differences between RAG and alternatives, consider the following figure.

Specifically, the radar chart compares three different methods:

  • Pretrained LLM,
  • Pretrained + finetuned LLM and
  • Pretrained + RAG LLM.
RAG LLM comparison

This radar chart is a graphical representation of multidimensional data in which each method is evaluated against several criteria, shown as axes on the chart. The criteria include

  • Cost,
  • Complexity,
  • Domain-specific knowledge,
  • Actuality,
  • Explainability and
  • Avoidance of hallucinations.

Each method is represented as a polygon in the diagram, with the vertices of the polygon corresponding to the values of these criteria for that method.

For example:

The Pretrained LLM method has relatively low values for "Cost", "Complexity", "Domain Specific Knowledge", and "Hallucination Avoidance", but a higher value for "Timeliness" and "Explainability".

The "Pretrained + finetuned LLM" method, on the other hand, has higher values for "Cost", "Complexity", "Domain specific knowledge" and "Hallucination avoidance", but lower values for "Timeliness" and "Explainability". Finally, the "Pretrained + RAG LLM" method has a unique pattern with high values for "Up-to-date", "Explainability" and "Domain specific Knowledge".

The Pretrained + RAG LLM method is characterized by domain-specific knowledge, up-to-date information, explainability, and avoidance of hallucinations. This is probably due to the fact that the RAG approach allows the model to explain information using graph structures, which can improve its understanding, prevent hallucinations, and provide more transparent and accurate answers in specific domains.

Contextual and topical response generation with RAG "how to".

The Retrieval Augmented Generation (RAG) process consists of the following 3 steps:

  1. Create a vector database from area-specific data:
    The first step in implementing RAG is to create a Vector database from your domain-specific proprietary data. This database serves as the source of knowledge that RAG draws from to provide contextually relevant answers. To create this vector database, perform the following steps:
  2. Conversion to vectors (embeddings):
    To make your domain-specific data usable by RAG, you need to convert it into mathematical vectors. This conversion process is achieved by running your data through an embedding model, which is a special kind of Large Language Model (LLM). These embedding models are capable of converting various types of data, including text, images, video, or audio, into arrays or groups of numeric values. Importantly, these numeric values reflect the meaning of the input text, much like another person understands the essence of the text when they speak it aloud.
  3. Creation of vector databases:
    Once you have obtained the vectors that represent your domain-specific data, you create a vector database. This database serves as a repository for semantically rich information encoded in the form of vectors. In this database, RAG searches for semantically similar elements based on the numerical representations of the stored data.

The following diagram illustrates how to create a vector database from your domain-specific proprietary data. To create your vector database, you convert your data to vectors by running it through an embedding model. In the following example, we convert Konfuzio documents (Konfuzio Documents) that contain the latest information about Konfuzio. The data can consist of text, images, videos or audios:

How to create a vector database from your domain-specific proprietary data (Vector Database and the Konfuzio Documents)

Integration of retrieved expertise (context) into LLMs.

Now that you have built a vector database with domain-specific knowledge, the next step is to integrate this knowledge into LLMs. This integration is done through a so-called "context window".

Think of the context window as the LLM's field of view at a given time:

When RAG is in action, it's like holding up a map of critical points from the domain-specific database to the LLM.

This context window allows the LLM to access and integrate important data. This ensures that its responses are not only coherent, but also contextually correct.

By embedding domain-specific knowledge into the context window of the LLM, RAG increases the quality of the generated answers. RAG enables the LLM to draw on the extensive data stored in the vector database. This makes its responses more informed and relevant to the user's queries.

In the diagram below, we illustrate how RAG works using "Konfuzio Documents" as an example:

LLM's RAG workflow with Konfuzio documents

With the help of our RAG workflows we can force our Large Language Model (generator) to stick to the content of our knowledge base (Konfuzio Documents) that is most relevant to answering the user query.

Et voilà, the result: Retrieval Augmented Generation ! ✅

Update - Good to know

On the one hand, Konfuzio Azure for OCR, on the other hand Azure's API today allows the conversion of documents into Markdown. This in turn means that Konfuzio can use this function to convert your documents into Markdown and then feed them into the generation part, which is based on an LLM, in RAG.

This can improve the accuracy and performance of your RAG pipeline!

The reason for this is the property that these Markdown representation provides more information and context about the documents than before - in the form of tables, images, checkboxes, etc.

Konfuzio Azure OCR Markdown


The increasing integration of Large Language Models (LLMs) into our daily lives has undoubtedly brought many benefits, but it also has its limitations. The challenge is that LLMs, such as GPT-3, GPT-4, Llama 2, and Mistral-7B, have difficulty in providing timely, contextual information as well as domain-specific knowledge. This presents a significant obstacle, especially when accurate and relevant responses are required.

Retrieval Augmented Generation (RAG) is proving to be a promising solution in this regard. RAG enables the integration of external retrieval systems with large language models, allowing these models to access extensive knowledge bases and up-to-date information. This allows them to better understand user-defined queries and provide more precise, contextual answers.

So why would you use RAG and not rely on alternative approaches?

  1. RAG enables the provision of real-time information and up-to-date knowledge, which is especially critical in fast-moving fields and for making informed decisions.
  2. RAG allows the integration of domain-specific knowledge into the answer generation. This is essential when specialized knowledge is required.
  3. Unlike some alternative approaches, RAG provides a more transparent and traceable method for answering questions because it is based on existing data and facts.
  4. RAG minimizes the likelihood of false or fabricated information by accessing external, reliable sources.

In summary, Retrieval Augmented Generation fills the gaps in the capabilities of LLMs and enables reliable answers to complex questions. This makes it a promising method for the future of machine intelligence communication and support in a wide range of applications.

Do you have any questions or are you interested in a demonstration of the Konfuzio Infrastructure?

Write us a message. Our team of experts will be happy to advise you.

    Mohamed Dhiab Avatar

    Latest articles