Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 13: Data (Sources, Datasets)
By Stanford Online
Key Concepts
- Data Pipeline Stages: Pre-training (raw web data), Mid-training (high-quality/synthetic data), and Post-training (instruction/chat/RLHF).
- Data Curation: The process of filtering, deduplication, and quality assessment, which is currently a major bottleneck in LLM development.
- Legal/Technical Restrictions:
robots.txt, Terms of Service (ToS), rate limiting, and copyright law (Fair Use). - Fair Use: A legal doctrine (Section 107 of the Copyright Act) that allows the use of copyrighted material without a license based on four factors: purpose, nature of the work, amount used, and market effect.
- Data Sources: Common Crawl, Wikipedia, GitHub, arXiv, and "Shadow Libraries" (e.g., Libgen, Books3).
- Quality Filtering: Rule-based filtering (e.g., C4) vs. Model-based filtering (e.g., DCLM, Nematron).
1. The Importance and Secrecy of Data
The instructor argues that data is the most critical component of language models. Companies like Meta (Llama 3) disclose architecture and training procedures but remain secretive about data due to:
- Competitive Advantage: Data is the "secret sauce."
- Copyright Liability: Fear of litigation regarding the use of proprietary or copyrighted content.
2. The Reality of "Training on the Internet"
Contrary to the popular belief that models are trained on the "entire internet," the process is constrained by:
- Technical Barriers: Dynamic web content (apps), authentication/paywalls, and bot-blocking mechanisms (Cloudflare, CAPTCHAs).
- Legal/Ethical Barriers:
robots.txt(a non-legal but standard "good citizen" contract) and increasingly restrictive Terms of Service that explicitly prohibit AI training. - The "Consent in Crisis" Trend: Research shows a sharp increase in websites blocking crawlers since mid-2023.
3. Copyright and Legal Frameworks
- Intellectual Property: Copyright protects the expression of an idea, not the idea itself.
- Fair Use: The primary defense for model developers. Key precedents include the Authors Guild v. Google case (Google Books), which established that providing snippets of copyrighted works can be transformative.
- Recent Litigation:
- Anthropic Case: The court ruled that training on books was fair use, but the act of pirating (unauthorized downloading) the books was illegal.
- Meta/Llama Case: Similar legal challenges regarding the use of copyrighted books in training sets.
4. Data Processing Methodologies
The lecture outlines the evolution of data curation:
- Rule-Based Filtering (e.g., C4): Using heuristics (e.g., "must end in punctuation," "remove bad words," "remove curly braces") to clean raw web crawls.
- Model-Based Filtering (e.g., DCLM, Nematron): Training a classifier to identify "high-quality" data.
- DCLM: Uses a linear classifier trained on high-quality instruction data (e.g., Hermes) to filter massive web dumps.
- Nematron: Uses LLMs to score documents for "educational value" and employs synthetic data to rephrase low-quality content.
- Linearization: Converting non-linear data (like GitHub pull requests, issues, and comments) into a structured, sequential format for training.
5. Notable Sources and Case Studies
- Wikipedia: A gold standard for quality; often used as a reference for training quality classifiers.
- GitHub: Essential for reasoning capabilities. Developers must filter for permissive licenses (MIT/Apache) and remove malware or bot-generated noise.
- Shadow Libraries (e.g., Books3): Previously used for training but now largely avoided due to legal risks and copyright infringement.
- Common Pile: A project attempting to build a high-performing model using only permissively licensed data, proving that while possible, it is difficult to compete with models trained on broader, unlicensed datasets.
6. Synthesis and Conclusion
Data processing is currently "based on vibes"—a mix of heuristics, classifiers, and trial-and-error. The field is shifting from "more data" to "higher quality data." The primary takeaway is that while architecture is becoming commoditized, the ability to curate, filter, and legally navigate the data landscape remains the primary differentiator for state-of-the-art language models. Future research should focus on more principled, less arbitrary methods for data selection and quality assessment.
Chat with this Video
AI-PoweredLoad the transcript when you're ready to chat so the initial page stays lighter.