Hướng dẫn xây dựng RAG chatbot version 2.0
By Việt Nguyễn AI
Key Concepts
- RAG (Retrieval-Augmented Generation): An AI framework that retrieves data from external sources to provide context to Large Language Models (LLMs).
- Chunking: The process of breaking down large documents into smaller, manageable segments (chunks) for vectorization.
- Recursive Character Text Splitter: A basic chunking method that splits text based on character count and predefined separators.
- Semantic Chunking: An advanced technique that splits text based on the semantic meaning and context of sentences rather than fixed character lengths.
- Vector Database: A storage system for embedding vectors used for similarity searches.
- Breakpoint Threshold: A numerical value used in semantic chunking to determine when to split text based on the similarity score between consecutive sentences.
1. The RAG Workflow and Limitations
The video reviews the standard RAG pipeline:
- Document Loading: Importing raw data (e.g., PDFs).
- Chunking: Dividing text into segments.
- Embedding: Converting text segments into fixed-length vectors.
- Vector Storage: Saving vectors in a database.
- Retrieval: Matching user queries to relevant chunks via vector similarity.
- Generation: Passing the retrieved context and query to an LLM to generate a response.
The Problem: The speaker argues that the "Chunking" step is often overlooked. Using the Recursive Character Text Splitter leads to "incomplete" chunks—sentences are cut off mid-word, and tables are split across pages, resulting in fragmented context that degrades the LLM's output quality.
2. Advanced Methodology: Semantic Chunking
To overcome the limitations of character-based splitting, the speaker introduces Semantic Chunking.
Step-by-Step Process:
- Sentence Segmentation: The text is broken into individual sentences using punctuation (periods, exclamation marks, etc.).
- Vectorization: Each sentence is converted into an embedding vector.
- Similarity Calculation: The system calculates the semantic similarity (cosine similarity) between consecutive sentences.
- Threshold Comparison: A "breakpoint" threshold (e.g., 0.85) is set.
- If similarity > 0.85: The sentences are considered semantically related and kept in the same chunk.
- If similarity < 0.85: A semantic shift is detected, and a new chunk is created.
3. Implementation and Technical Details
The speaker demonstrates the transition from the basic RecursiveCharacterTextSplitter to the SemanticChunker using the LangChain framework.
- Code Modification: Replace the old splitter with
SemanticChunker. - Key Parameters:
embedding_model: The model used to generate sentence embeddings (e.g., OpenAI Embeddings).breakpoint_threshold_type: Set to "percentile" or "standard deviation" (implied by the 0.85 threshold logic).breakpoint_threshold_amount: Set to 0.85 to define the sensitivity of the split.
4. Comparative Analysis
| Feature | Recursive Character Splitter | Semantic Chunker | | :--- | :--- | :--- | | Basis | Character count | Semantic meaning | | Integrity | Often cuts words/tables mid-way | Preserves sentence/table structure | | Context | Fragmented/Incomplete | Coherent and contextually complete | | Cost/Complexity | Low (computationally cheap) | Higher (requires embedding calls) |
5. Notable Quotes
- "If this step [chunking] is done poorly, we will produce low-quality tiles. But once the bricks are of poor quality, no matter how skilled the engineer... we will never be able to build a good, durable house."
- "Semantic chunking will allow us to cut text based on semantic changes in content, rather than on the number of characters."
6. Synthesis and Conclusion
The primary takeaway is that data preparation is the foundation of RAG performance. While character-based splitting is easy to implement, it frequently destroys the logical flow of information. Semantic chunking, despite the added cost of extra API calls for embeddings, significantly improves the quality of retrieved context by ensuring that chunks are semantically self-contained. The speaker concludes by noting that while semantic chunking is superior for most use cases, it is just one of many techniques, and developers should choose their chunking strategy based on the specific nature of their data.
Chat with this Video
AI-PoweredLoad the transcript when you're ready to chat so the initial page stays lighter.