Training an LLM from Scratch, Locally — Angelos Perivolaropoulos, ElevenLabs
By AI Engineer
Key Concepts
- Transformer Architecture: A deep learning model architecture based on self-attention mechanisms, used for sequence modeling.
- Tokenization: The process of converting raw text into numerical representations (tokens) that a model can process.
- Causal Self-Attention: A mechanism allowing the model to weigh the importance of different tokens in a sequence, specifically looking at past tokens to predict the next one.
- Embeddings: Vector representations of tokens that capture semantic meaning.
- Logits: The raw, unnormalized output scores from the model, representing the probability distribution of the next token.
- Cross-Entropy Loss: A loss function used to measure the performance of a classification model whose output is a probability value between 0 and 1.
- Overfitting: A scenario where a model performs exceptionally well on training data but fails to generalize to unseen data (indicated by rising validation loss).
- Inference: The process of using a trained model to generate new text.
- Temperature & Top-K Sampling: Techniques used during inference to control the randomness and creativity of generated text.
1. Workshop Overview and Objectives
The workshop, led by Angelos from 11 Labs, focuses on training a small-scale Large Language Model (LLM) from scratch using PyTorch. The goal is to provide a hands-on understanding of how research engineers design models, moving beyond pre-trained weights to build a functional system. The project uses a GPT-2-based architecture, chosen for its foundational simplicity and effectiveness.
2. Building Blocks of the Transformer
The model is constructed using four primary components:
- Tokenizer: Converts text to integers. The workshop uses a character-level tokenizer (65 unique tokens) because it is computationally efficient for small datasets.
- Model Architecture: A causal decoder-only transformer. Key components include:
- Multi-head Self-Attention: Allows the model to focus on different relationships between tokens (e.g., grammar, punctuation).
- MLP (Feed-Forward Network): Processes the relationships identified by attention to generate logits.
- Residual Connections: Added to stabilize training by ensuring each layer only makes incremental changes to the input.
- Layer Normalization: Prevents activation values from exploding during deep network passes.
- Training Loop: The most critical phase, involving data loading, forward passes, loss calculation, and backpropagation.
- Inference: The generation phase, where the model predicts the next token based on the learned probability distribution.
3. Step-by-Step Methodology
- Environment Setup: Use Python 3.12, PyTorch, and
uvfor dependency management. Google Colab is recommended for access to free T4 GPUs. - Data Preparation: The model is trained on a Shakespeare dataset (approx. 1 million characters). Data is split into training and validation sets.
- Model Configuration:
- Block Size (Context Window): 256 tokens.
- Embedding Dimension: 384.
- Layers: 6.
- Attention Heads: 6.
- Total Parameters: ~1.8 million.
- Training Process:
- Learning Rate: Starts low (warm-up), peaks, and then uses cosine decay via the AdamW optimizer.
- Monitoring: Track training loss (should decrease) and validation loss (to detect overfitting).
- Inference: Use temperature (e.g., 0.7) and Top-K sampling to balance creativity and coherence, avoiding the "boring" nature of pure greedy decoding.
4. Key Arguments and Insights
- Scaling Laws: Increasing context length or model size requires architectural changes, not just changing a variable, to prevent memory crashes.
- The Importance of Data: High-quality data is the primary driver of performance. For reasoning models, "Chain of Thought" data is essential and must be curated by experts (e.g., via Scale AI).
- Training vs. Fine-tuning: Most modern LLM advancements (e.g., Gemini 3 to 3.1) are driven by smarter training methodologies and higher-quality post-training data rather than fundamental changes to the base transformer architecture.
- Multimodality: Multimodal models (audio/video) often use specialized encoders to convert inputs into vector embeddings that the transformer can process, effectively treating different modalities as sequences of vectors.
5. Notable Quotes
- "The pre-training is very similar... it's the fine-tuning and post-training... that actually makes the big difference in performances."
- "If your validation loss is increasing while training loss decreases, that means you overfit."
- "Reasoning is essentially just adding to the context of the model... that then the model can... attend to those reasoning tokens and get a better response."
6. Synthesis and Conclusion
The workshop demonstrates that building an LLM is fundamentally about managing matrix calculations and probability distributions. While large-scale models involve complex optimizations for scale and context, the core architecture remains consistent. The key takeaways are:
- Start small: Character-level tokenization and small models are perfect for learning the mechanics.
- Monitor metrics: Use validation loss to identify the "sweet spot" before overfitting occurs.
- Iterate: Once the base model works, improvements come from better data, optimized training loops, and refined inference techniques (temperature/sampling).
Chat with this Video
AI-PoweredLoad the transcript when you're ready to chat so the initial page stays lighter.