Back to all videos

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 5 - LLM tuning

By Unknown Author

LLM Training Reinforcement Learning Model Alignment Preference Tuning

Share:

Key Concepts

Pre-training: Initial training of an LLM on a massive dataset to learn language and code structure, resulting in a model that can predict the next token.
Fine-tuning (SFT - Supervised Fine-Tuning): Training a pre-trained model on a smaller, high-quality dataset for specific tasks, teaching it how to behave for those tasks.
Preference Tuning: A third stage of training to align LLM outputs with human preferences, focusing on tone, safety, and desired behavior rather than just factual correctness.
Preference Pairs: Datasets consisting of a prompt and two responses, one preferred and one not preferred.
Pointwise, Pairwise, Listwise Data: Different methods for collecting preference data, with pairwise being the most common due to ease of collection.
Reinforcement Learning from Human Feedback (RLHF): A framework for preference tuning that uses reinforcement learning.
- Agent: The LLM.
- Environment: The set of tokens in the vocabulary.
- State: The input received so far.
- Action: Predicting the next token.
- Policy: The LLM's probability distribution over next tokens.
- Reward: A signal indicating the quality of a generated response.
Reward Model (RM): A model trained to predict a score for a given prompt-response pair, indicating its quality.
Bradley-Terry Formulation: A mathematical model used to train the reward model, estimating the probability of one output being better than another.
Proximal Policy Optimization (PPO): An RL algorithm used for preference tuning that aims to maximize rewards while preventing the model from deviating too much from its base policy.
- Advantage: A measure of how much better an output is compared to an expected outcome.
- Value Function: A model that estimates the expected future reward from a given state.
- KL Divergence: A measure of the difference between two probability distributions, used in PPO to penalize large deviations from the reference policy.
PPO Clip: A variant of PPO that uses clipping to limit the magnitude of policy updates.
KL Penalty: A variant of PPO that uses KL divergence to penalize deviations from the reference policy.
Best of N (BoN): A simpler method that generates multiple responses and selects the best one based on a reward model, avoiding RL training but increasing inference cost.
Direct Preference Optimization (DPO): A more recent method that directly optimizes the LLM's weights using preference pairs, bypassing the need for a separate reward model and RL.

LLM Tuning: From Pre-training to Preference Alignment

This lecture delves into the crucial stages of aligning Large Language Models (LLMs) with human expectations, building upon previous discussions of pre-training and supervised fine-tuning (SFT). The primary focus is on preference tuning, a third stage designed to imbue LLMs with desirable human-like qualities such as tone, safety, and helpfulness.

1. Recap of Previous Stages

Pre-training: This initial, computationally intensive phase involves training an LLM on vast amounts of text and code data. The goal is to teach the model the structure of language and code, enabling it to predict the next token. Techniques like data parallelism (e.g., ZeRO variants) and model parallelism are employed to manage the computational demands. The outcome is a model that excels at autocompletion but lacks specific task-oriented behavior or alignment with human preferences.
Fine-tuning (SFT): Following pre-training, SFT refines the model for specific tasks. This involves training on a smaller, high-quality dataset. The objective is to teach the model how to behave in desired ways, transforming it into, for instance, a helpful chat assistant. Parameter-efficient methods like LoRA (Low-Rank Adaptation) are discussed, which tune only a subset of parameters by introducing low-rank matrices, making the process more efficient.

2. The Need for Preference Tuning

While SFT teaches a model what to generate, it doesn't explicitly teach it what not to generate or how to generate it in a preferred manner. Preference tuning addresses this by aligning the model with human preferences, which can encompass aspects like tone (e.g., friendliness) and safety.

Example: An SFT model might correctly suggest activities with a teddy bear but in an unappealing or overly blunt tone. Preference tuning aims to refine this to a more gentle and engaging response.

Why a Third Step?

Data Collection Efficiency: Creating high-quality SFT datasets is time-consuming. Preference data, which involves comparing existing outputs, is often easier to collect. Humans find it simpler to judge which of two poems is better than to write a perfect poem from scratch.
Prompt Distribution Control: SFT data needs careful balancing of prompt distributions to avoid model bias. Adding examples to correct misbehavior in SFT can inadvertently skew the model. Preference tuning offers a way to address specific misbehaviors without drastically altering the prompt distribution.
Injecting Negative Signals: SFT primarily focuses on positive examples. Preference tuning allows for the explicit injection of negative signals, teaching the model what to avoid.

3. Data Collection for Preference Tuning

The foundation of preference tuning lies in collecting preference pairs.

Data Collection Methods:
- Pointwise: Scoring individual responses (difficult to scale and be consistent).
- Pairwise: Comparing two responses and selecting the better one (most common due to ease).
- Listwise: Ranking a list of responses (more complex than pairwise).
Generating Preference Pairs:
1. Prompt Generation: Prompts can be sourced from user logs or desired prompt sets.
2. Response Generation: The prompt is fed to the LLM, often with a positive temperature to generate diverse responses.
3. Comparison: The generated pairs of (prompt, response) are then compared.
  - Human Ratings: Direct human judgment.
  - LLM as a Judge: Using another LLM to evaluate responses.
  - Rule-based Metrics: Traditional metrics like BLEU or ROUGE (less common now).
  - Binary Scale: Simple "better" or "worse" judgment.
  - Nuanced Scale: "Much better," "slightly better," etc. (more complex).
- Alternative Method: Identifying a disliked response in logs and rewriting it into a preferred version.

4. Reinforcement Learning from Human Feedback (RLHF)

RLHF is a prominent framework for preference tuning, leveraging reinforcement learning principles.

RL Basics: An agent interacts with an environment, taking actions based on a policy to maximize cumulative rewards.
- Agent: The LLM.
- Environment: The set of tokens.
- State: Input so far.
- Action: Predicting the next token.
- Policy ($\pi_\theta$): Probability distribution over actions given a state.
- Reward: Feedback signal for actions.
Transposing RL to LLMs:
- Agent: LLM.
- State: Input so far.
- Action: Predicting the next token.
- Policy: The LLM's output probability distribution.
- Reward: Derived from preference data.
RLHF Stages:
1. Reward Model Training: Train a model to distinguish good from bad outputs based on preference pairs.
2. RL Policy Optimization: Use the trained reward model to fine-tune the LLM policy.

4.1. Reward Model Training

The goal is to create a model that assigns a score to a prompt-response pair, indicating its quality.

Bradley-Terry Formulation: The probability of output $y_i$ being better than $y_j$ is given by: $P(y_i > y_j) = \frac{\exp(s(x, y_i))}{\exp(s(x, y_i)) + \exp(s(x, y_j))}$ where $s(x, y)$ is the score assigned by the reward model. This can be simplified using the sigmoid function: $P(y_i > y_j) = \sigma(s(x, y_i) - s(x, y_j))$ The reward model aims to assign higher scores to preferred outputs.
Loss Function: The training objective is to maximize the probability of observing the preference data. This leads to a loss function based on the negative log-likelihood of the preferred outcomes: $L = -E[\log(\sigma(s(x, y_w) - s(x, y_l)))]$ where $y_w$ is the winning (preferred) response and $y_l$ is the losing (dispreferred) response.
Model Architecture: The reward model can be an LLM with a classification head, or an encoder-only model like BERT.
Data Requirements: Typically tens of thousands of preference pairs are needed.
Human vs. AI Feedback: RLHF specifically refers to feedback from humans. Reinforcement learning from AI feedback (RLAIF) uses AI-generated preferences.
Reward Dimensions: Reward models can be trained for specific dimensions like usefulness, friendliness, or safety, or a holistic score.
Challenges: Human ratings can be noisy and depend heavily on clear guidelines. The reward model's scale might require normalization.

4.2. RL Policy Optimization (PPO)

This stage uses the trained reward model to update the LLM's policy.

General Recipe:
1. The LLM (policy) generates a completion for a prompt.
2. The prompt and completion are fed to the frozen reward model to get a reward score.
3. This reward signal is used to tune the LLM's policy.
Objective: Maximize rewards while preventing the model from deviating too much from its initial SFT model. This is crucial to avoid:
- Catastrophic Forgetting: Losing knowledge acquired during pre-training and SFT.
- Reward Hacking: Over-optimizing for the reward model's score, which may not perfectly align with true human preferences.
- Training Instabilities.
Proximal Policy Optimization (PPO): A popular RL algorithm for this task.
- Loss Function: Combines maximizing rewards (via advantage) with penalizing deviations from the reference model.
- Advantage Estimation: Uses a value function to estimate the expected reward from a given state, reducing variance and stabilizing training. The advantage is often calculated as reward - baseline (where baseline is the value function estimate).
- Generalized Advantage Estimation (GAE): A method for computing advantages.
- PPO Variants:
  - PPO Clip: Uses a clipping mechanism to limit the magnitude of policy updates, preventing drastic changes between iterations. The loss function involves a ratio of probabilities between the current and old policy, clipped within a certain range.
  - KL Penalty: Directly penalizes the KL divergence between the current policy and a reference policy (often the SFT model or the previous iteration's policy).
On-Policy Training: PPO is an on-policy algorithm, meaning it uses data generated by the current policy to update that same policy. This contrasts with off-policy methods that can use data from other policies.
Multiple Models: PPO requires managing several models: the policy LLM, the value function, the reward model, and the reference (SFT) model.

5. Alternatives to RLHF

The complexity and hyperparameter tuning involved in RLHF have led to the development of alternative methods.

Best of N (BoN):
- Method: Generate multiple completions (N) for a prompt using the SFT model, score them with the reward model, and return the highest-scoring one.
- Pros: Avoids RL training, simpler to implement.
- Cons: Significantly increases inference cost and latency due to multiple model queries. If all generated responses are bad, the output will also be bad.
Direct Preference Optimization (DPO):
- Method: Directly optimizes the LLM's policy using preference pairs without training a separate reward model or using RL. The loss function is derived from the Bradley-Terry formulation and directly penalizes the probability of dispreferred outcomes.
- Loss Function: Involves the log-probability of winning completions minus the log-probability of losing completions, scaled by a hyperparameter $\beta$ (similar to the KL penalty coefficient in PPO).
- Pros: Simpler training process, fewer models required (policy and reference model), avoids RL complexities.
- Cons: Can suffer from distribution shift if the preference data distribution differs significantly from the model's training data. PPO may still achieve higher performance in some cases.
- Key Insight: The paper proposing DPO is titled "Your Language Model is Secretly a Reward Model," highlighting that the policy itself can implicitly represent reward signals.

6. Practical Considerations and Challenges

Training Stability: RL-based methods can be prone to instability. Techniques like clipping and KL penalties help mitigate this.
Hyperparameter Tuning: Both RLHF and DPO involve numerous hyperparameters that require careful tuning.
Evaluation Metrics: Monitoring training progress can be challenging. Average rewards are used in RL, but cross-entropy loss from SFT provides a more direct measure of model reproduction.
Exploration vs. Exploitation: In RL, ensuring sufficient exploration of the output space is crucial for discovering better responses.
SFT vs. Preference Tuning: While SFT teaches what to generate, preference tuning focuses on how to generate it, refining tone and style without necessarily learning new facts.

7. Conclusion

Preference tuning is a vital step in developing helpful and aligned LLMs. While RLHF offers a powerful framework, its complexity has spurred the development of more accessible methods like BoN and DPO. DPO, in particular, presents a compelling supervised approach that directly optimizes for human preferences, offering a more streamlined path to achieving desired model behaviors. The choice of method often depends on computational resources, desired performance levels, and the expertise available for managing complex training pipelines.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video