Stanford CS25: V5 I Large Language Model Reasoning, Denny Zhou of Google Deepmind

Key Concepts

Large Language Model (LLM) Reasoning: Intermediate tokens between input and output, representing reasoning steps.
Chain of Thought (CoT) Prompting: Providing example reasoning steps to guide the LLM's output.
Self-Consistency: Generating multiple responses and selecting the most frequent answer.
Supervised Fine-Tuning (SFT): Training LLMs on datasets of problems and human-annotated solutions.
AI-Tuning: Using model-generated data instead of human data for fine-tuning.
Verification: Crucial for AI-Tuning, ensuring the correctness of generated reasoning paths.
Aggregation: Combining multiple responses to improve accuracy.
Retrieval: Incorporating relevant information from external sources to enhance reasoning.

Definition: LLM reasoning, in this context, refers to the intermediate tokens generated between the input and the final output. These tokens represent the steps the model takes to arrive at the answer.
Historical Context: The idea of using intermediate tokens for problem-solving is not new. A 2017 paper by Deman demonstrated using natural language to solve math problems, a concept also found in neuro-symbolic literature.
Motivating Example: The "last letter concatenation" task was used to illustrate the concept. Without reasoning, the model might directly output "LE." With reasoning, it would output steps like "The last letter of artificial is L, the last letter of intelligence is E, concatenating L and E to S LE."
Theoretical Justification: Collaboration with Professor Tima at Stanford showed that for any problem solvable by Boolean circuits of size T, constant-size transformers can solve it by generating O(T) intermediate tokens. This suggests that generating intermediate steps is crucial for solving complex problems.

Premise: Pre-trained models are capable of reasoning without further prompting engineering or fine-tuning.
Decoding is Key: The issue often lies in the decoding process. Greedy decoding, which selects the most probable token at each step, can lead to incorrect answers.
Example: A question about apples ("I have three apples, my dad has two more apples than me, how many apples do we have in total?") might yield the incorrect answer "five apples" with greedy decoding.
Chain of Thought Decoding: By exploring more generation candidates (beyond greedy decoding), the model can produce reasoning steps. For example, the second candidate might be "I have three apples and then my dad has two more apples than me so he has a five apples and 3 + 5 equal to 8."
Confidence-Based Selection: The response with the highest confidence on the final answer token (e.g., the token "8" in the apple example) is selected. In the example, the model's confidence in the token "8" was nearly 98%.

Chain of Thought Prompting: Providing a similar problem with its step-by-step solution as an example before the actual question. This changes the output distribution, pushing chain-of-thought solutions to the top.
Less Single Step by Step: A generic prompt ("Let's think step by step") that encourages the model to generate reasoning steps. While simple, it often performs worse than few-shot CoT prompting.
Pitfalls of Prompting: CoT prompting requires task-specific examples, and less single step by step can be unreliable.

Process: Collect a dataset of problems and human-annotated step-by-step solutions. Maximize the likelihood of the human solutions during training.
Examples: The Deman paper (2017) and OpenAI's GSM8K dataset are examples of this approach.
Limitations: SFT doesn't generalize well. Scaling the data alone doesn't solve the problem if the underlying paradigm is flawed.

Motivation: To address the generalization failure of SFT.
Process:
1. Let the model generate step-by-step solutions for a set of problems.
2. Use a verifier (if available) to check the correctness of the final answer.
3. Select the reasoning path if the answer is correct; otherwise, reject it.
4. Fine-tune the model using this model-generated data.
Self-Improvement Loop: The improved model can then be used to generate more training data, leading to a self-improvement cycle. This is similar to reinforcement learning fine-tuning.
Key Insight: Data generated by a better model can be more effective for training than human data.
Importance of Verification: A reliable verifier is crucial for AI-Tuning.

Goal: To build a model for reasoning, optimize the metric of measuring generation quality (R).
Formula: Maximize the expected value of R (response quality) given the problem and model parameters (θ).
Policy Gradient: Since the model is probabilistic, sampling is used to compute the expectation, leading to a policy gradient approach.
Scaling AI-Tuning: Scale the output length (length of CoT) rather than the model size.

Key Difference: LLM reasoning emerges from token-to-token prediction, unlike the exhaustive search used in classical AI.
Example: Gemini 2.0 was able to solve a math problem (using numbers 1 to 10 to make 2025) by generating human-like reasoning steps without explicit search.
Learning is Scalable: Emphasizing that learning is the key to scalable intelligence, rather than relying on search.

Verifiable Tasks: AI-Tuning is primarily applicable to automatically verifiable tasks (e.g., math problems).
Non-Verifiable Tasks: It's challenging to apply AI-Tuning to tasks like creative writing or complex coding projects where verification is subjective or difficult.

Aggregation (Self-Consistency):
- Problem: The model is designed to predict the next token, but the goal is to choose the most confident answer.
- Solution: Generate multiple responses by random sampling and choose the answer that appears most frequently. This approximates marginalization (summing over all possible reasoning paths).
- Example: For a math problem, generate multiple responses (e.g., $18, $26, $18) and choose the most frequent answer ($18).
- Benefits: Self-consistency can significantly improve accuracy.
- Self-Calibration: Higher consistency indicates higher accuracy.
Retrieval:
- Motivation: Incorporate relevant information from external sources to enhance reasoning.
- Example: For a geometry problem, prompting the model to recall related problems (e.g., finding the distance between two points) can help it solve the problem.
- Step Back Prompting: For physical problems, encourage the model to consider more abstract principles before solving the problem.
- Deep Research: A similar idea, where the model finds similar problems or knowledge to aid in problem-solving.

Key Takeaways:
- LLM reasoning is improved by using intermediate steps.
- AI-Tuning is better than SFT.
- Aggregating multiple answers is better than relying on a single answer.
- Retrieval plus reasoning is better than reasoning alone.
Future Research:
- Solving tasks beyond uniquely verifiable answers.
- Building real-world applications instead of just focusing on benchmarks.