Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 13: Data 1

By Unknown Author

AITechnologyBusiness
Share:

Key Concepts

  • Pre-training: Training on raw, often web-scraped data.
  • Mid-training: Curating a smaller, high-quality dataset for specific capabilities.
  • Post-training: Fine-tuning for instruction following, chat, or safety.
  • Base Model: Checkpoint after pre-training and mid-training.
  • Instruction Model: Model after post-training.
  • Common Crawl: Monthly web crawl, an academic approximation of the internet.
  • Data Poisoning: Injecting malicious edits into training data.
  • Quality Filtering: Using heuristics or models to select high-quality data.
  • Deduplication: Removing duplicate content from the dataset.
  • Fair Use: Using copyrighted material without a license under certain conditions.
  • Long Context Extension: Training models to handle longer sequences of text.
  • Synthetic Data: Data generated by language models for training.

Pre-training Data

Importance of Data

  • Data is arguably the most important factor in language model performance.
  • Companies are secretive about their training data due to competitive dynamics and legal concerns.
  • Data curation and cleaning are crucial, even with less annotation.
  • Data work is highly scalable and can involve large teams focusing on different aspects like multilinguality and code.

Stages of Training

  • Pre-training: Training on raw data from the web.
  • Mid-training: Curating high-quality data for specific capabilities (math, code, long context).
  • Post-training: Fine-tuning on instruction following data, chat data, or using reinforcement learning for safety and conversational ability.

Example Data Mixes

  • AI2's OMO Model:
    • Pre-training: Web pages (DCL on baseline), code, academic papers, math, Wikipedia (3.9 trillion tokens).
    • Mid-training: Filtered DCL on baseline, Flan data sets, Wikipedia, synthetically generated data, GSM8K training set (10 billion tokens).
    • Post-training: Chat data from various sources, synthetically generated data.

Early Data Sets (2018-2019)

  • BERT: Trained on books (Smashwords corpus) and Wikipedia.
    • Books corpus: Scraped from Smashwords, consisting of self-published ebooks.
    • Wikipedia: A collaborative encyclopedia with no original thought, based on citations and notability.
    • Data Poisoning: Vulnerability where malicious edits can be injected into Wikipedia dumps.
  • GPT-2: Used WebText, a dataset of links from Reddit posts with more than three karma points.

Common Crawl and Filtering

  • Common Crawl: Established in 2007, monthly web crawl.
  • Crawling process: Uses a BFS approach with seed URLs, respecting robots.txt.
  • Data formats: WARC (raw HTTP response) and WET (text extracted from HTML).
  • HTML to text conversion: Using raw WARC files with tools like Trafilatura can improve performance.
  • CCNet (Meta): Filters Common Crawl using language identification and a 5-gram model based on Wikipedia to identify high-quality documents.
  • C4 (Google): Colossal Clean Crawled Corpus, filtered using heuristics (punctuation, sentence length, bad words, brace removal).
  • Trade-off: Model-based filtering (CCNet) vs. rule-based filtering (C4). Model-based is limited by the quality of positive examples, while rule-based can include spam or exclude valuable content.

GPT-3 Era (2020)

  • GPT-3 Data Set: Common Crawl, processed WebText2, books corpora, and Wikipedia (400 billion tokens).
  • Quality classification: Trained a classifier to distinguish high-quality web data, Wikipedia, and books from the rest.
  • The Pile (EleutherAI): Curated 22 high-quality domains (Common Crawl, OpenWebText, Stack Exchange, Wikipedia, etc.).
  • Project Gutenberg: Collection of public domain books (75,000 books).
  • Books3: Books from a shadow library (taken down due to copyright infringement).
  • Stack Exchange: Collection of Q&A sites (Stack Overflow, etc.).
  • GitHub: Source of code for language model training.
  • The Stack: Open-source version of code based on GitHub (3.1 terabytes of code).

Later Data Sets (2021-Present)

  • Gopher (DeepMind): Massive Text data set (MassiveWeb, C4, books, news, GitHub, Wikipedia).
  • MassiveWeb: Used manual rules for quality filtering and Google Safe Search for toxicity filtering.
  • Llama (Meta): Common Crawl processed with CCNet, C4, GitHub, Wikipedia, Project Gutenberg, Books3, Archive, Stack Exchange (1.2 trillion tokens).
  • Red Pajama: Open-source reproduction of the Llama data set.
  • RefinedWeb: Focuses on filtering web data effectively.
  • FineWeb from Hugging Face: Improved version of RefinedWeb, using all Common Crawl dumps and manual rules.
  • OMO (AI2): Trained on the Doma data set (Common Crawl, The Stack, C4, Reddit, Semantic Scholar, Project Gutenberg, Wikipedia).
  • DataComp (Collaboration): Competition for creating data sets, using Common Crawl.
  • DCM Pool: 240 trillion tokens from Common Crawl.
  • DCM Baseline: Filtered DCM Pool using a quality filter based on Open Hermes (GPT-4 generated instruction data) and Eli5 (subreddit).
  • Neatron CC (Nvidia): Aims to expand the DCLM baseline with more tokens.
  • Used JustText for HTML to text conversion to retain more tokens.
  • Prompted a large language model to score documents based on educational value.
  • Used the DCLM classifier and ensembled different classifiers.
  • Rephrased low-quality data sets and generated tasks for high-quality data sets.

Copyright

Copyright Law

  • Copyright law incentivizes the creation of intellectual goods.
  • Applies to original works of authorship fixed in a tangible medium of expression.
  • Copyright applies to expression, not ideas.
  • Registration is not required for copyright, but it is required before suing for infringement.
  • Copyright lasts 75 years.
  • Most things on the internet are copyrighted.

Using Copyrighted Material

  • Licensing: Obtain a license from the copyright holder.
    • Creative Commons license: Allows free distribution of copyrighted work.
  • Fair Use: Use copyrighted material without a license under certain conditions.
    • Factors: Purpose and character of the use, nature of the work, amount used, effect on the market.
  • Training ML models can be argued as transformative, but models can memorize and affect the market.
  • Terms of use can restrict data access even with a license or fair use claim.

Mid-training and Post-training Data

Long Context Extension

  • Books and math are used to create data with long-range dependencies.

Instruction Following Data

  • Supernatural Instructions and FLAN: Convert traditional NLP benchmarks into a standard format.
  • Synthetic data: Generated by language models using self-instruct, conversations, or evol-instruct methods.
  • Open Hermes: Glomeration of different data sets.
  • Llama 2 Chat: Used annotators to write high-quality instruction data.
  • Llama Neatron Postraining Data: Public data sets (WildChat) and synthetically generated data.

Synthetic Data Generation

  • GPT-4: Used to generate synthetic data, but it is against the terms of Open AI to use GP4 to create a data set then train a competing model.
  • Open-weight models: More permissive licenses for distillation.
  • Annotators: Hire annotators to create high-quality instructions.

Conclusion

  • Data curation is a crucial and labor-intensive process.
  • Data is a key differentiator for language models.
  • Legal and ethical issues (copyright) are significant.
  • The field is heuristic, with many opportunities for improvement.

Chat with this Video

AI-Powered

Hi! I can answer questions about this video "Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 13: Data 1". What would you like to know?

Chat is based on the transcript of this video and may not be 100% accurate.

Related Videos

Ready to summarize another video?

Summarize YouTube Video