Back to all videos

Training an LLM from Scratch, Locally — Angelos Perivolaropoulos, ElevenLabs

By AI Engineer

Constraint: No broad terms (e.g.Finance Technology"). Use precise terms.Key Concepts:* Transformer Architecture

Share:

Key Concepts

Transformer Architecture: A deep learning model architecture based on self-attention mechanisms, used for sequence modeling.
Tokenization: The process of converting raw text into numerical representations (tokens) that a model can process.
Causal Self-Attention: A mechanism allowing the model to weigh the importance of different tokens in a sequence, specifically looking at past tokens to predict the next one.
Embeddings: Vector representations of tokens that capture semantic meaning.
Logits: The raw, unnormalized output scores from the model, representing the probability distribution of the next token.
Cross-Entropy Loss: A loss function used to measure the performance of a classification model whose output is a probability value between 0 and 1.
Overfitting: A scenario where a model performs exceptionally well on training data but fails to generalize to unseen data (indicated by rising validation loss).
Inference: The process of using a trained model to generate new text.
Temperature & Top-K Sampling: Techniques used during inference to control the randomness and creativity of generated text.

1. Workshop Overview and Objectives

The workshop, led by Angelos from 11 Labs, focuses on training a small-scale Large Language Model (LLM) from scratch using PyTorch. The goal is to provide a hands-on understanding of how research engineers design models, moving beyond pre-trained weights to build a functional system. The project uses a GPT-2-based architecture, chosen for its foundational simplicity and effectiveness.

2. Building Blocks of the Transformer

The model is constructed using four primary components:

Tokenizer: Converts text to integers. The workshop uses a character-level tokenizer (65 unique tokens) because it is computationally efficient for small datasets.
Model Architecture: A causal decoder-only transformer. Key components include:
- Multi-head Self-Attention: Allows the model to focus on different relationships between tokens (e.g., grammar, punctuation).
- MLP (Feed-Forward Network): Processes the relationships identified by attention to generate logits.
- Residual Connections: Added to stabilize training by ensuring each layer only makes incremental changes to the input.
- Layer Normalization: Prevents activation values from exploding during deep network passes.
Training Loop: The most critical phase, involving data loading, forward passes, loss calculation, and backpropagation.
Inference: The generation phase, where the model predicts the next token based on the learned probability distribution.

3. Step-by-Step Methodology

Environment Setup: Use Python 3.12, PyTorch, and uv for dependency management. Google Colab is recommended for access to free T4 GPUs.
Data Preparation: The model is trained on a Shakespeare dataset (approx. 1 million characters). Data is split into training and validation sets.
Model Configuration:
- Block Size (Context Window): 256 tokens.
- Embedding Dimension: 384.
- Layers: 6.
- Attention Heads: 6.
- Total Parameters: ~1.8 million.
Training Process:
- Learning Rate: Starts low (warm-up), peaks, and then uses cosine decay via the AdamW optimizer.
- Monitoring: Track training loss (should decrease) and validation loss (to detect overfitting).
Inference: Use temperature (e.g., 0.7) and Top-K sampling to balance creativity and coherence, avoiding the "boring" nature of pure greedy decoding.

4. Key Arguments and Insights

Scaling Laws: Increasing context length or model size requires architectural changes, not just changing a variable, to prevent memory crashes.
The Importance of Data: High-quality data is the primary driver of performance. For reasoning models, "Chain of Thought" data is essential and must be curated by experts (e.g., via Scale AI).
Training vs. Fine-tuning: Most modern LLM advancements (e.g., Gemini 3 to 3.1) are driven by smarter training methodologies and higher-quality post-training data rather than fundamental changes to the base transformer architecture.
Multimodality: Multimodal models (audio/video) often use specialized encoders to convert inputs into vector embeddings that the transformer can process, effectively treating different modalities as sequences of vectors.

5. Notable Quotes

"The pre-training is very similar... it's the fine-tuning and post-training... that actually makes the big difference in performances."
"If your validation loss is increasing while training loss decreases, that means you overfit."
"Reasoning is essentially just adding to the context of the model... that then the model can... attend to those reasoning tokens and get a better response."

6. Synthesis and Conclusion

The workshop demonstrates that building an LLM is fundamentally about managing matrix calculations and probability distributions. While large-scale models involve complex optimizations for scale and context, the core architecture remains consistent. The key takeaways are:

Start small: Character-level tokenization and small models are perfect for learning the mechanics.
Monitor metrics: Use validation loss to identify the "sweet spot" before overfitting occurs.
Iterate: Once the base model works, improvements come from better data, optimized training loops, and refined inference techniques (temperature/sampling).

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video