Tất tần tật về RAG cơ bản trong 20 phút

Key Concepts

RAG (Retrieval-Augmented Generation): A technique that enhances AI models by retrieving relevant information from private data sources before generating a response.
Knowledge Cutoff: The date at which a model's training data ends, limiting its knowledge of subsequent events.
Hallucination: When an AI generates plausible-sounding but factually incorrect information.
Chunking: The process of breaking large documents into smaller, manageable segments.
Embedding Model: A model that converts text/data into numerical vectors.
Vector Database: A specialized database for storing and querying embedding vectors.
Cosine Similarity: A metric used to measure the similarity between two vectors by calculating the cosine of the angle between them.

1. The Necessity of RAG

Modern AI tools (ChatGPT, Claude, etc.) are widely used for information retrieval. However, they face two major limitations:

Knowledge Cutoff: Models cannot answer questions about events occurring after their training data was finalized (e.g., GPT-4's cutoff in 2023). While web browsing features mitigate this, they are not always effective for private or proprietary data.
Lack of Private Context: AI models lack access to internal company documents, private medical records, or specific legal contracts. When asked about such data, models often "hallucinate"—providing confident but false answers.

Proposed Solution: RAG allows developers to feed "private data" into a chatbot, ensuring it only retrieves information from that specific, trusted source rather than relying on general internet knowledge.

2. The RAG Workflow: Step-by-Step

The process is divided into two main phases:

Phase A: Pre-processing (System Side)

Knowledge Base Creation: Collecting diverse data formats (text, PDF, code, etc.).
Chunking: Breaking large documents into smaller "chunks" to improve the model's ability to process and retrieve specific details.
- Techniques: Length-based (token count), structural (sentences/paragraphs), or Semantic Chunking (splitting based on meaning shifts).
Embedding: Using an Embedding Model to convert chunks into numerical vectors.
- API-based models: Easy to use, high quality, but require sending data to third parties (e.g., OpenAI's text-embedding-3).
- Self-hosted models: Offer better data privacy and offline capability but require infrastructure management (GPU).
Storage: Saving these vectors into a Vector Database for efficient management and retrieval.

Phase B: Retrieval and Generation (User Side)

Query Vectorization: The user's question is converted into a vector using the same embedding model used in the pre-processing phase.
Similarity Search: The system searches the Vector Database for chunks whose vectors are most similar to the query vector, typically using Cosine Similarity.
Contextual Generation: The retrieved chunks are provided as "context" to the Large Language Model (LLM). The LLM then synthesizes an answer based only on the provided context.

3. Key Arguments and Perspectives

Data Privacy: RAG is essential for enterprises, banks, and hospitals where data security is paramount. By keeping data within a private vector database, organizations prevent sensitive information from being exposed to public models.
Efficiency vs. Convenience: While attaching files to a chat is a temporary fix, it is inconvenient and limited by file count constraints. RAG provides a scalable, automated, and more accurate alternative.
Technical Precision: The speaker emphasizes that "tokens" are not equivalent to "words," and that proper chunking is critical to prevent the loss of context during the retrieval process.

4. Synthesis and Conclusion

RAG is a transformative technique that bridges the gap between the vast general knowledge of LLMs and the specific, private requirements of real-world applications. By implementing a structured pipeline—from intelligent chunking and embedding to vector-based retrieval—developers can build AI agents that are reliable, context-aware, and secure. The future of AI development lies in these specialized architectures that prioritize data integrity and precise information retrieval over simple, ungrounded generation.