With the explosion of LLMs and Chat Assistants, researchers and users of these models quickly bump into a limiting factor:
What happens if my model has not been trained on the specific topic or dataset I'm interested in prompting it with?
If we do nothing then we're likely to get unhelpful responses at best, or factually incorrect hallucinations at worst. However, there are some solutions to this problem:
Retrieval Augmented Generation (RAG) is a technique which automates the retrieval of relevant information from datastores connected with a language model, aiming to optimize the output of the model. Ideally, the RAG technique eliminates:
RAG has multiple stages — many of these are partially or fully implemented in libraries and products surrounding emerging LLM solutions, though naive RAG approaches can be quite easy to implement. Because of its simplicity and fast time to reasonable results, RAG is a fundamental technique to most LLM-based solutions emerging today.
In theory, RAG allows us to quickly pull in relevant context and produce more reliable answers. It opens up various enterprise data-stores to immediate interactive query such as:
It is imaginable that almost any internal data store could become part of an interactive knowledge base using RAG as a foundation. Indeed, this is what hyperscalers are beginning to present to their users and customers as the fundamentals of an AI based enterprise. It reliably sidesteps the issue of training data being out of date by allowing the model to access up to date sources with much greater ease.
(Related reading: LLM security with the OWASP Top 10 threats to LLMs.)
RAG approaches the problem of adding necessary additional context much the same way a human answering a question does when they don't already know an answer:
In its simplest form, achieving this process using a RAG system has three components which map to those steps:
Indexing is the process of taking some set of documents or datastore which you would like your model to be able to access, cleaning it, creating appropriate chunks, and then forming an index and embedding. This makes finding the most relevant parts of your datastore easier for your model.
Indexes are often implemented as a vector database as such a database with an appropriate embedding model can make finding similarity between text chunks very efficient.
(Related reading: data normalization.)
Retrieval is the process of taking a user input query and using the index to find chunks of text in your datastore which are relevant to creating an answer.
This is achieved by transforming your input query using an embedding model which produces a vector. This vector can then be used to find similar chunks of text stored in the indexed vector database. For example, you might gather the "top x" number of relevant similar chunks from your index.
Lastly, generation takes the initially proposed query or prompt and combines it with the relevant chunks of information gathered via retrieval to produce a final prompt for the LLM or Assistant.
In the best case this prompt contains all of the information needed for the LLM to produce an appropriate response.
RAG is beneficial in a number of LLM applications as it automates retrieval of relevant information for answering queries that otherwise might not be available to the pre-trained model. It can be used to pull in proprietary or context specific information without the need for expensive and slow model fine tuning or re-training.
This allows organisations to use third-party models to answer questions on relevant data without the need to create their own from scratch. (This is important in enabling more people, teams, and organizations to experiment with LLMs.)
It also potentially reduces the rate of hallucinations or unhelpful responses.
Whilst the theory presented above is relatively simple to understand, the devil, as usual, is in the detail. There are a number of areas where RAG can be difficult to implement or can struggle to produce the best answers.
For example, depending upon the methods used to index and embed a datastore, the retrieval step may struggle to find either the most relevant chunks or all appropriate context needed to find an answer.
The retrieval and generation steps are also quite sensitive to the size of the chunks used. For instance:
RAG also does not prohibit hallucinations in responses, thus continuing to make it difficult to trust outputs from models.
(Related reading: principles for trustworthy AI.)
OpenAI's CEO Sam Altman has noted in interviews that he was surprised by how quickly ChatGPT was adopted and grew. Their expectation was that many enterprises would want to fine tune models — thus creating fine-tuned models would be a limiting factor of adoption.
Part of this unexpected explosion was because so many users realized that answers could be gathered by augmenting queries with contextual information directly within prompts — the concept of prompt engineering.
RAG automates this process. Therefore, RAG is likely to be a fundamental component of almost all LLM systems used in enterprise situations, except where large effort is expended to specifically fine-tune or train models for specific tasks.
It has become very obvious to those using RAG and LLM systems that there are many challenges in producing reliable and consistent answers even when using RAG. Today, there are a plethora of different modified RAG approaches which attempt to improve on the naïve approach in a number of ways.
Advanced RAG includes pre- and post-processing steps for data and prompts, adjusting them to better fit with data structures and models in specific cases, thus improving answer accuracy and value.
Modular RAG uses additional modules to manipulate inputs, outputs, and responses in various ways. This may be adding additional contextual interfaces via:
Modular RAG may also involve creating new iterative pipelines which allow responses to be iteratively refined or fused together using multiple prompts or models. It may further allow the integration of feedback mechanisms to tune prompts and retrieval mechanisms over time.
There are tens or hundreds of variations of RAG that have and could be created using these approaches and others, each of which are going to have specific advantages and disadvantages. As the technology behind LLMs and assistants progresses we will begin to get a sense of which approaches work for different applications.
In the meantime, many of these naïve, advanced, and modular approaches are implemented in libraries such as LangChain and LlamaIndex to make it easier to start to experiment with systems based around LLMs while integrating other datastores.
In Splunk, keep an eye on the work of Huaibo Zhao and Philipp Dreiger as well as future developments in the Splunk App for Data Science and Deep Learning as we further expand its capabilities to include LLM integrations and ways to implement RAG on your own data in Splunk.
Splunk’s own AI assistants rely on RAG as a core component and you can read more about that in this Technical Reivew of Splunk AI Assistant for SPL.
At the time of writing in August 2024 this review paper has a great summary of the state of the art with respect to RAG, so please take a look to dive into the details.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.