The Holy Grail of Intelligence - Explained

Continual Learning: Levels, Approaches, and the Future of LLMs

Key Concepts:

Continual Learning: The ability of a model to learn new information over time without forgetting previously learned knowledge.
Catastrophic Forgetting: The tendency of neural networks to overwrite previously learned information when trained on new tasks.
Stability-Plasticity Tradeoff: The inherent conflict between a model’s ability to retain existing knowledge (stability) and its ability to learn new information (plasticity).
Session Memory: Remembering information within a single conversation.
Cross-Session Memory: Remembering information across multiple conversations.
Task Adaptation: Improving performance on a specific task over time.
True Continual Learning: Updating model weights in real-time without forgetting or performance degradation.
Learning from Failures: Utilizing mistakes as data to improve model performance.
Skills (Entropic Cloud Skills): A system for progressive disclosure of information to LLMs, enabling adaptive context and a form of practical continual learning.

The Problem with Current LLMs & The Need for Continual Learning

Current Large Language Models (LLMs) like GPT-4, Claude, and Gemini are “frozen” after their initial training. This means their knowledge is limited to the data they were trained on, and they cannot learn from ongoing interactions. Every conversation starts anew, lacking the ability to retain information beyond the context window. This limitation is the core problem that continual learning aims to address. The speaker emphasizes that continual learning isn’t just about remembering past conversations, but about fundamentally improving a model’s capabilities over time. As stated, “Every single session starts completely from scratch as if you had never talked before.”

Five Levels of Continual Learning

The speaker proposes a framework categorizing continual learning into five levels, ranging from easiest to hardest:

Session Memory: Already largely solved through context windows. The focus is on increasing context window size.
Cross-Session Memory: Progress is being made with external memory systems and Retrieval Augmented Generation (RAG), but retrieval can be imperfect, sometimes missing crucial context.
Task Adaptation: Achieved through fine-tuning, but excessive fine-tuning can lead to a loss of general capabilities.
True Continual Learning: Real-time updating of model weights without forgetting – the primary focus of research aiming for breakthroughs in 2026.
Learning from Failures: The “holy grail” – using mistakes as data to improve the model, essentially self-correction. This is considered the ultimate goal of continual learning.

The speaker illustrates this with a coding example: ideally, an LLM assisting with code would remember past corrections and avoid repeating the same errors, becoming specifically tuned to a user’s codebase.

Catastrophic Forgetting & The Stability-Plasticity Tradeoff

The speaker explains catastrophic forgetting as the core challenge preventing neural networks from continual learning. When a neural network is trained on a new task, it tends to overwrite previously learned information due to the shared weights used for all tasks. This leads to the stability-plasticity tradeoff: a model needs both stability (to retain existing knowledge) and plasticity (to learn new things), but these are opposing forces. “Turn it all the way towards stability. The model can't learn new task. It is frozen in place. Now turn it all the way around toward plasticity. The model forgets everything with each new task.” The speaker notes that while workarounds exist, the underlying reasons for this tradeoff are not fully understood.

Two Competing Perspectives: Skeptics vs. Pragmatists

The speaker outlines two opposing viewpoints within the AI community regarding continual learning:

The Skeptics (e.g., Dwaish Patel): Argue that current LLM architectures are fundamentally incapable of learning on the job and that a new architecture is required. Dwaish Patel is quoted as saying, “LLMs aren’t capable of learning on the job. We will need some new architecture to enable continual learning.” They believe scaling alone won’t solve the problem.
The Pragmatists (e.g., Nathan Lambert): Believe continual learning is a solvable system problem, not an algorithmic one. They suggest that scaling context windows, improving memory systems, and enhancing retrieval methods can achieve results indistinguishable from true continual learning. They view it as a matter of engineering better systems around existing models.

The speaker suggests both perspectives have merit, with the skeptics focusing on the AGI vision and the pragmatists on practical applications.

Current Workarounds: Entropic Cloud Skills

The speaker highlights Entropic’s “Cloud Skills” as a current workaround demonstrating a pragmatic approach. Unlike Model Context Protocol (MCP) which dumps all information upfront, Cloud Skills utilizes progressive disclosure. The model initially sees only skill names and descriptions, loading full instructions and files only when a specific skill is needed. This keeps the context window lean and focuses on relevant information, mirroring how humans learn.

Cloud Skills also incorporates a self-improvement mechanism:

Manual Mode: Users can trigger a /reflect command to analyze a session, extract corrections, and update the skill.
Automatic Mode: A stop hook automatically learns from each session without user intervention.

This system allows the model to remember corrections and avoid repeating mistakes, effectively creating a system that feels like it’s learning. The speaker emphasizes that Skills are simple markdown files, reinforcing the pragmatic approach of engineering a solution rather than solving the algorithmic problem.

Economic Implications

The speaker points out the significant economic implications of continual learning. Retraining frontier models currently costs millions of dollars. Continual learning, if successful, could drastically reduce these costs by enabling incremental updates to model weights instead of full retraining. This would shift compute costs from training to inference, lowering the barrier to entry for smaller players and potentially democratizing AI development.

Conclusion

The speaker concludes that while true algorithmic continual learning (akin to human learning) may be years away and require new architectures, pragmatic approaches like Entropic’s Cloud Skills are already delivering results. For practical applications – creating tools that remember user preferences and improve at specific tasks – we are closer than many realize. The speaker believes the pragmatic approach will win for practical applications due to its immediate effectiveness and rapid improvement, even if it’s a “hack” rather than an elegant solution. “I think we’re going to engineer our way to good enough continual learning before we actually solve it algorithmically.”