Many of the problems developers face with Retrieval-Augmented Generation (RAG) boil down to this: Individual chunks don’t contain sufficient context to be properly used by the retrieval system or the Language Model (LLM). This leads to the inability to answer seemingly simple questions and, more worryingly, hallucinations.
Examples of the Problem
- Chunks often refer to their subject via implicit references and pronouns. This causes them to not be retrieved when they should be, or to not be properly understood by the LLM.
- Individual chunks often don’t contain the complete answer to a question. The answer may be scattered across a few adjacent chunks.
- Adjacent chunks presented to the LLM out of order cause confusion and can lead to hallucinations.
- Naive chunking can lead to text being split “mid-thought” leaving neither chunk with useful context.
- Individual chunks often only make sense in the context of the entire section or document, and can be misleading when read on their own.
Solution Approach
We’ve found that there are two methods that together solve the bulk of these problems:
1. Contextual Chunk Headers
The idea here is to add higher-level context to the chunk by prepending a chunk header. This chunk header could be as simple as just the document title, or it could use a combination of document title, a concise document summary, and the full hierarchy of section and sub-section titles.
2. Chunks to Segments
Large chunks provide better context to the LLM than small chunks, but they also make it harder to precisely retrieve specific pieces of information. We need a more dynamic system that can retrieve short chunks when that’s all that’s needed, but can also retrieve very large chunks when required.
Implementation of the Solution
Step 1: Breaking the Document into Sections
Information about the section a chunk comes from can provide important context, so our first step is to break the document into semantically cohesive sections. This can be done using a semantic sectioning approach, where an LLM is prompted to identify the starting and ending lines for each “semantically cohesive section.”
Example: When analyzing Nike’s 2023 Annual Report (Form 10-K), the first 10 semantically cohesive sections were identified. This helped structure the document for further processing.
Step 2: Adding Contextual Chunk Headers
The purpose of the chunk header is to add context to the chunk text. Rather than using the chunk text by itself when embedding and reranking the chunk, we use the concatenation of the chunk header and the chunk text. This helps the ranking models (embeddings and rerankers) retrieve the correct chunks, even when the chunk text itself has implicit references and pronouns that make it unclear what it’s about.
Example: For a chunk from the Nike report, the header might include the document title “Nike Inc. 2023 Form 10-K” and the section title “Management’s Discussion and Analysis of Financial Condition and Results of Operations”. This adds crucial context to the chunk’s content.
Step 3: From Chunks to Segments
To find the optimal segment, we use a modified version of the maximum subarray problem. Chunk values are calculated in such a way that the value of a segment is just the sum of the values of its constituent chunks. This allows us to dynamically construct segments of relevant text, making our retrieval system less sensitive to mistakes made by the ranking model.
Example: For the query “Nike stock-based compensation expenses”, the algorithm identified chunks 397-410 as the most relevant segment of text from the document. This segment contained detailed information about the company’s stock-based compensation policies, including methods for valuing stock options.
Conclusion
By using contextual chunk headers and dynamically constructing segments of relevant text, we can significantly improve the quality of information retrieval in RAG systems. This approach provides more complete context to the LLM and makes our retrieval system less sensitive to ranking mistakes, ultimately leading to more accurate and relevant responses to user queries.
Leave a Reply
You must be logged in to post a comment.