Memory in LLMs: Weights and Activations - Jack Morris, Cornell
By AI Engineer
Key Concepts
- Limitations of LLMs: Current Large Language Models (LLMs) like ChatGPT struggle with specialized knowledge, recent events, and private data due to a “knowledge cut-off” and limitations in context window size.
- Knowledge Injection Approaches: Three primary methods for providing LLMs with specific information: full context, Retrieval Augmented Generation (RAG), and training information directly into the model’s weights.
- RAG Limitations: While currently dominant, RAG faces challenges in security, scalability, and capturing nuanced relationships.
- Weight-Based Training as the Future: Directly incorporating knowledge into model weights through fine-tuning (SFT, GRPO) and synthetic data generation is considered the most promising long-term solution.
- Data Volume & Frequency Impact Strategy: The optimal approach (RAG vs. training) depends on the size and rate of change of the dataset.
- Hybrid Approach: A future combining periodic weight-based training with RAG for real-time information retrieval is envisioned.
Understanding LLM Knowledge Limitations
Large Language Models (LLMs) like ChatGPT, despite their impressive capabilities, have inherent limitations. They lack access to current events (e.g., a Blue Jays World Series win post-training), struggle with specialized tasks (like optimizing AMD GPU kernels), and cannot access private or company-specific data. This is due to a “knowledge cut-off” – information after their training period is inaccessible without external tools. The core problem is enabling LLMs to “know” the information we want them to know.
Approaches to Knowledge Injection
Three approaches to address this knowledge gap were identified:
- Full Context: Providing all relevant data directly to the LLM as input. Effective for smaller datasets, but quickly becomes prohibitively expensive and slow as token count increases (slowdown from 10,000 tokens/second to 130 tokens/second with 128k tokens).
- Retrieval Augmented Generation (RAG): Retrieving only the most relevant information from a larger dataset and feeding it to the LLM. This is currently the most common approach, relying on vector databases to store and retrieve embeddings (numerical representations of text). However, vector databases have security concerns (embeddings can be reverse-engineered) and limitations in capturing complex relationships.
- Training into Weights: Directly modifying the model’s parameters to incorporate new knowledge. This is considered the most promising, but also the most challenging, long-term solution.
The Transformer Architecture & Context Window
The transformer architecture, specifically the self-attention mechanism, limits the practical size of the context window due to its quadratic dependency on input length. Models like Grock 4 (2 million tokens) and Gemini 3 (1 million tokens) are pushing boundaries, but performance degrades with increased context length, especially when irrelevant information is added. LLMs have a limited capacity to store information (estimated at 3.6 bits per parameter), necessitating prioritization of relevant knowledge.
RAG vs. Fine-tuning: A Shifting Landscape
The discussion evolved to focus on the trade-offs between RAG and fine-tuning (SFT & GRPO) for incorporating new information. While RAG is currently ubiquitous, the consensus is that fine-tuning the model weights offers significant potential, particularly for complex reasoning and reducing reliance on lengthy prompts. SFT requires significantly more parameters than GRPO to achieve equivalent performance.
The Role of Data Volume & Frequency
The volume and rate of change of data are critical factors. RAG is favored for smaller, less frequently updated datasets, while fine-tuning becomes more viable with large, static or slowly evolving datasets. Identifying the “boundary” where training becomes economically feasible compared to continually optimizing RAG is a key challenge.
Synthetic Data Generation & Personalization
The potential of generating synthetic data to train LLMs on proprietary information was explored. This involves creating question-answer pairs or other training data to teach the model about a specific data domain. Success depends on the “information density” of the existing data. Scaling personalization to millions of users is considered feasible by training small models (a few megabytes) per user, similar to YouTube’s data usage, though continual updates remain a hurdle.
Optimization & Future Directions
Three axes of optimization were identified: data compression for RAG, RAG implementation itself, and weight-based training. The group acknowledged being “horrible” at weight-based training, indicating a focus on improving fine-tuning techniques. A hybrid approach – periodic weight-based training combined with RAG for real-time information retrieval – is envisioned.
Examples & Case Studies
Throughout the discussion, several examples were used to illustrate the concepts:
- Personal Experiences: Using ChatGPT for presentation preparation and cooking.
- Medical Records: Demonstrating the potential of full context with smaller datasets.
- Thesis Feedback: Illustrating performance slowdown with large context windows.
- Mastercard vs. Visa Embeddings: Showing limitations of conventional embeddings.
- Quarterly Earnings Reports (10K/10Q): A target for training a model to answer questions without prompts.
- GitHub Repositories: Illustrating the benefits of storing information in weights at scale.
- YouTube Data: An analogy for the scale of personalization data.
Conclusion
The conversation highlights a growing recognition that relying solely on RAG is suboptimal. The future of knowledge injection into LLMs lies in a shift towards weight-based training, particularly through fine-tuning and synthetic data generation. While challenges remain, the potential for more efficient, accurate, and personalized LLM performance is significant, and further research in this area is crucial. The speakers are actively seeking collaborators to tackle these challenges and build models that can be effectively “taught” new information.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "Memory in LLMs: Weights and Activations - Jack Morris, Cornell". What would you like to know?