Training an LLM from Scratch, Locally — Angelos Perivolaropoulos, ElevenLabs

By AI Engineer

Share:

Key Concepts

  • Transformer Architecture: A deep learning model architecture based on self-attention mechanisms, used for sequence modeling.
  • Tokenization: The process of converting raw text into numerical representations (tokens) that a model can process.
  • Causal Self-Attention: A mechanism allowing the model to weigh the importance of different tokens in a sequence, specifically looking at past tokens to predict the next one.
  • Embeddings: Vector representations of tokens that capture semantic meaning.
  • Logits: The raw, unnormalized output scores from the model, representing the probability distribution of the next token.
  • Cross-Entropy Loss: A loss function used to measure the performance of a classification model whose output is a probability value between 0 and 1.
  • Overfitting: A scenario where a model performs exceptionally well on training data but fails to generalize to unseen data (indicated by rising validation loss).
  • Inference: The process of using a trained model to generate new text.
  • Temperature & Top-K Sampling: Techniques used during inference to control the randomness and creativity of generated text.

1. Workshop Overview and Objectives

The workshop, led by Angelos from 11 Labs, focuses on training a small-scale Large Language Model (LLM) from scratch using PyTorch. The goal is to provide a hands-on understanding of how research engineers design models, moving beyond pre-trained weights to build a functional system. The project uses a GPT-2-based architecture, chosen for its foundational simplicity and effectiveness.

2. Building Blocks of the Transformer

The model is constructed using four primary components:

  • Tokenizer: Converts text to integers. The workshop uses a character-level tokenizer (65 unique tokens) because it is computationally efficient for small datasets.
  • Model Architecture: A causal decoder-only transformer. Key components include:
    • Multi-head Self-Attention: Allows the model to focus on different relationships between tokens (e.g., grammar, punctuation).
    • MLP (Feed-Forward Network): Processes the relationships identified by attention to generate logits.
    • Residual Connections: Added to stabilize training by ensuring each layer only makes incremental changes to the input.
    • Layer Normalization: Prevents activation values from exploding during deep network passes.
  • Training Loop: The most critical phase, involving data loading, forward passes, loss calculation, and backpropagation.
  • Inference: The generation phase, where the model predicts the next token based on the learned probability distribution.

3. Step-by-Step Methodology

  1. Environment Setup: Use Python 3.12, PyTorch, and uv for dependency management. Google Colab is recommended for access to free T4 GPUs.
  2. Data Preparation: The model is trained on a Shakespeare dataset (approx. 1 million characters). Data is split into training and validation sets.
  3. Model Configuration:
    • Block Size (Context Window): 256 tokens.
    • Embedding Dimension: 384.
    • Layers: 6.
    • Attention Heads: 6.
    • Total Parameters: ~1.8 million.
  4. Training Process:
    • Learning Rate: Starts low (warm-up), peaks, and then uses cosine decay via the AdamW optimizer.
    • Monitoring: Track training loss (should decrease) and validation loss (to detect overfitting).
  5. Inference: Use temperature (e.g., 0.7) and Top-K sampling to balance creativity and coherence, avoiding the "boring" nature of pure greedy decoding.

4. Key Arguments and Insights

  • Scaling Laws: Increasing context length or model size requires architectural changes, not just changing a variable, to prevent memory crashes.
  • The Importance of Data: High-quality data is the primary driver of performance. For reasoning models, "Chain of Thought" data is essential and must be curated by experts (e.g., via Scale AI).
  • Training vs. Fine-tuning: Most modern LLM advancements (e.g., Gemini 3 to 3.1) are driven by smarter training methodologies and higher-quality post-training data rather than fundamental changes to the base transformer architecture.
  • Multimodality: Multimodal models (audio/video) often use specialized encoders to convert inputs into vector embeddings that the transformer can process, effectively treating different modalities as sequences of vectors.

5. Notable Quotes

  • "The pre-training is very similar... it's the fine-tuning and post-training... that actually makes the big difference in performances."
  • "If your validation loss is increasing while training loss decreases, that means you overfit."
  • "Reasoning is essentially just adding to the context of the model... that then the model can... attend to those reasoning tokens and get a better response."

6. Synthesis and Conclusion

The workshop demonstrates that building an LLM is fundamentally about managing matrix calculations and probability distributions. While large-scale models involve complex optimizations for scale and context, the core architecture remains consistent. The key takeaways are:

  • Start small: Character-level tokenization and small models are perfect for learning the mechanics.
  • Monitor metrics: Use validation loss to identify the "sweet spot" before overfitting occurs.
  • Iterate: Once the base model works, improvements come from better data, optimized training loops, and refined inference techniques (temperature/sampling).

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video