Back to all videos

Stanford CS25: Transformers United V6 I From Next-Token Prediction to Next-Generation Intelligence

By Unknown Author

Constraint: No broad terms (e.g.Finance Technology"). Use precise terms.Two-Phase Pre-training:* Curriculum strategy

Share:

Key Concepts

Two-Phase Pre-training: A curriculum strategy prioritizing data diversity in Phase 1 and high-quality data in Phase 2.
Front-loading Reasoning: The practice of injecting reasoning-style data during pre-training rather than treating it as a post-training (SFT/RL) afterthought.
RLP (Reinforcement as a Pre-training Objective): A method where models generate explicit reasoning traces (thoughts) before predicting the next token, rewarded by information gain.
Information Gain Reward: A dense, non-binary reward function calculated as log(P_theta) - log(P_phi), where P_theta is the probability of the next token given a reasoning trace, and P_phi is the probability without it.
Data Mixture Optimization: A process involving quality estimation (using classifiers like FineWeb-EDU) and epoch estimation (determining optimal repeat counts for data sources).

1. The Recipe for SOTA LLMs

The speaker defines four pillars for building state-of-the-art (SOTA) Large Language Models:

Smart Data: High-quality, diverse, filtered, and deduplicated data.
Smart Architecture: Evolving from standard Transformers to hybrid architectures like Mamba 2.
Smart Algorithms: Advanced training recipes (e.g., curriculum learning, RLP).
Smart Collaboration: Synergy between pre-training, post-training, research, and engineering teams.

2. Maximizing Data Potential: The Two-Phase Approach

The speaker contrasts four hypothetical learners (Pascal, Volta, Ampere, and Hopper) to illustrate the impact of training strategies.

Methodology:
- Quality Estimation: Using classifiers to weigh high-quality data (e.g., math, code, Wikipedia) more heavily than low-quality web crawls.
- Epoch Estimation: Determining the maximum number of times a data source can be repeated before yielding diminishing returns.
- Two-Phase Curriculum: Phase 1 focuses on broad diversity (web crawls); Phase 2 focuses on high-quality data (math, code).
Evidence: The two-phase approach (Volta) outperformed a random-ordering baseline (Pascal) by 17% on average.

3. Front-loading Reasoning

The speaker argues that current pipelines—which treat reasoning as a post-hoc skill added during SFT or RL—create "unreasoning foundations."

Key Findings:
- Injecting reasoning data during pre-training provides a 16% gain immediately post-pre-training.
- These gains are not "washed away" by SFT; in fact, they compound, resulting in a 9.3% improvement over models that did not see reasoning data during pre-training.
- Durable Advantage: Even when compute is doubled during SFT for a "no-reason" model, it cannot catch up to a model that was "reasoning-primed" during pre-training.

4. RLP: Reinforcement as a Pre-training Objective

RLP shifts the paradigm from "learning by observing" (next-token prediction) to "learning by thinking."

The Process:
1. Thought Policy: The model generates a reasoning trace before predicting the next token.
2. Information Gain Reward: A dense reward is calculated based on how much the reasoning trace improves the prediction probability compared to a "No-Think" baseline.
3. Exponential Moving Average (EMA): The "No-Think" baseline is updated with a lag to provide a stable comparison and prevent reward hacking.
Performance:
- RLP outperformed standard next-token prediction by 14% even when the latter was exposed to 35x more data (flop-matched).
- RLP scales effectively with model size and architecture (e.g., Mamba 2).
- Unlike RPT (Reinforcement Pre-training) or RLPT, RLP is verifier-free and uses dense rewards, allowing it to be applied to any token in a document without external filtering.

5. Notable Quotes

"The idea behind [front-loading reasoning] is that you will do well if you take those [AP] classes during school, then you'll not only do well in school, but you'll also do well in college."
"RLP produces an explicit reasoning trace before predicting the next token and this makes the 'why' of it very visible and trainable and not just the final answer."
"[RLP] suggests that even using unannotated text streams... you can still teach reasoning-like behavior while strengthening the foundation."

Synthesis and Conclusion

The presentation establishes that the future of LLM pre-training lies in algorithmic efficiency rather than just scaling data volume. By implementing a two-phase curriculum, front-loading reasoning data, and utilizing RLP to incentivize "thinking" during pre-training, models develop a more robust, durable reasoning foundation. These strategies allow models to achieve superior performance with fewer tokens, effectively bridging the gap between simple pattern matching and genuine reasoning.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video