2 NEW Methods to Level Up Your RAG AI Agent (n8n)
By The AI Automators
AITechnology
Share:
Key Concepts
- Lost Context Problem: The issue where RAG agents fail to answer accurately or hallucinate due to the separation of document chunks, leading to a loss of contextual information.
- Chunking: Dividing a document into smaller segments for processing by a RAG system.
- Late Chunking: An approach where embeddings are created for the entire document first, followed by chunking, preserving contextual relationships between segments.
- Contextual Retrieval: A technique that leverages LLMs to provide context to each chunk, enhancing retrieval accuracy by adding descriptive blurbs.
- Long Context Embedding Models: Embedding models with large context windows, enabling the processing of entire documents for more accurate embeddings.
- Pooling/Aggregation: Averaging vectors associated with a chunk to create a single embedding that represents the chunk.
- Prompt Caching: Storing the results of LLM prompts to reduce the cost and time of repeated queries.
- Vector Store: A database that stores vector embeddings for efficient similarity search and retrieval.
- RAG (Retrieval-Augmented Generation): A framework where an AI model retrieves relevant information from a knowledge base to generate more accurate answers.
1. The Lost Context Problem in RAG Systems
- Problem Definition: Standard RAG systems often struggle with accuracy and can hallucinate answers due to the "lost context problem." This occurs when documents are split into chunks, and each chunk is processed independently, losing the contextual relationship with other parts of the document.
- Example: A Wikipedia article about Berlin is chunked into sentences. Only the first sentence mentions "Berlin." Subsequent sentences refer to "its" or "the city," requiring the context of the first sentence to be understood.
- Standard RAG Process: Chunks are sent independently to an embedding model to create vectors, which are then stored in a vector store. This isolation leads to a loss of context.
- Consequences:
- Inaccurate Retrieval: Chunks with important information may not be retrieved because they don't explicitly mention the query term.
- Hallucinations: Unrelated chunks with higher scores may be retrieved, leading the LLM to generate responses based on irrelevant information.
2. Late Chunking: Embedding First, Then Chunking
- Concept: Late chunking addresses the lost context problem by embedding the entire document (or as much as fits within the long context window) before chunking.
- Process:
- Embedding: Load the entire document into a long context window embedding model. Every token within the document gets a vector embedding.
- Chunking: Apply a chunking strategy (sentences, paragraphs, fixed length, etc.) to segment the document.
- Vector Association: Identify the vectors associated with each chunk.
- Pooling/Aggregation: Average the vectors within each chunk to create a single embedding representing the chunk.
- Storage: Store the aggregated embeddings in a vector database.
- Advantage: Because all embeddings are created within the context of the entire document, the relationships between sentences and paragraphs are preserved. Even if a chunk doesn't explicitly mention the query term, its embedding will reflect the contextual information.
- Requirement: Long context embedding models are essential for this technique. Models like Mistral and Jina AI's embedding models support up to 32,000 and 8,000 tokens, respectively.
- Example: In the Berlin example, even though the second and third sentences don't mention "Berlin," their tokens will reflect that "Berlin" was in context during embedding.
- N8N Implementation: The video demonstrates a workflow in N8N that implements late chunking using Jina AI's embedding model. This involves fetching a document, splitting it into large chunks (limited by the model's context window), then further splitting into granular chunks, generating embeddings for all chunks simultaneously, and upserting the vectors to a Quadrant vector store.
3. Contextual Retrieval with Context Caching
- Concept: Contextual retrieval leverages the long context window of LLMs to provide context to each chunk.
- Process:
- Chunking: Split the document into chunks.
- Contextual Analysis: Send each chunk and the entire document to an LLM. Ask the LLM to analyze the chunk in the context of the document and provide a descriptive blurb.
- Contextual Augmentation: Add the descriptive blurb to the chunk.
- Embedding: Send the augmented chunk to an embedding model to create a vector.
- Storage: Store the vectors in a vector database.
- Advantage: The descriptive blurb provides additional context to the embedding, improving retrieval accuracy.
- Challenge: This approach can be time-consuming and expensive, as it requires sending the entire document to the LLM for each chunk.
- Context Caching: To mitigate the cost, prompt caching can be used. The entire document is cached on the LLM's servers, and the LLM is instructed to use the cached document when generating the descriptive blurb for each chunk.
- N8N Implementation: The video demonstrates a workflow in N8N that implements contextual retrieval with context caching using Gemini 1.5 Flash. This involves fetching a document, estimating its token length, caching the document on Gemini's servers, splitting the document into chunks, looping through the chunks, enhancing each chunk with a descriptive blurb generated by Gemini, and upserting the vectors to a Quadrant vector store.
- Gemini 1.5 Flash: Used for its long context window and cost-effectiveness. Files need to be larger than 32,000 tokens to be cached.
- Rate Limits: The video highlights the challenges of rate limits when calling the LLM for each chunk. The author had to adjust the batch size and add delays to avoid exceeding the token limits per minute.
- Cost Analysis: The author estimates that processing a single PDF document using contextual retrieval with context caching would cost around $130 in Gemini tokens.
4. Comparison and Evaluation
- Late Chunking vs. Standard RAG: The video compares the quality of answers generated by a RAG system using late chunking versus a standard RAG setup. The late chunking approach appears to retrieve a wider array of chunks, resulting in more comprehensive answers.
- Contextual Retrieval vs. Standard RAG: The video also compares the quality of answers generated by a RAG system using contextual retrieval versus a standard RAG setup. The contextual retrieval approach seems to provide more detailed and thorough answers.
- Evaluation Framework: The video emphasizes the importance of having an evaluation framework to compare different RAG techniques and determine which one performs best for a specific use case.
- Quantitative Benchmarks: The video references quantitative evaluations from GINA AI and Anthropic, which show that late chunking and contextual retrieval can significantly improve retrieval accuracy, especially for longer documents.
5. Key Takeaways and Conclusion
- The lost context problem is a significant challenge in RAG systems, leading to inaccurate retrieval and hallucinations.
- Late chunking and contextual retrieval are two techniques that can mitigate this problem by preserving contextual relationships between document segments.
- Late chunking involves embedding the entire document before chunking, while contextual retrieval involves using an LLM to provide context to each chunk.
- Both techniques require long context window models and can be more complex and expensive to implement than standard RAG.
- The choice of technique depends on the specific use case, the size and type of documents, and the available resources.
- The video provides practical examples of how to implement these techniques in N8N, highlighting the challenges and considerations involved.
- Joining the AI Automators community provides access to the workflows and support needed to test and implement these advanced RAG techniques.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "2 NEW Methods to Level Up Your RAG AI Agent (n8n)". What would you like to know?
Chat is based on the transcript of this video and may not be 100% accurate.