Why LLMs Hallucinate (and How to Stop It)

Key Concepts

Hallucinations in Language Models (LLMs), Pre-training, Post-training, Next Word Prediction, Token Distribution, Self-Attention, Cross-Entropy, Binary Classification, Abstinence, Reward Function, Accuracy-Based Evaluation, Hallucination Evaluation, Uncertainty Expression, Negative Marking, Leaderboard Incentives.

Why Language Models Hallucinate: An OpenAI Perspective

The video discusses a paper from OpenAI that explains why language models hallucinate, arguing that hallucinations are not mysterious but arise from errors in binary classification during training and evaluation.

How Language Models Work

Pre-training and Post-training: LLMs are trained in two steps. First, they are pre-trained on massive text datasets to recognize patterns and predict the next word based on token distribution.
Beyond Next Word Prediction: While often described as next word predictors, LLMs perform more complex tasks. They understand grammar, maintain long-distance context through self-attention, possess word knowledge, and filter distractors.
- Example: The example given is "In Paris, the capital of France, the primary language spoken is..." The model needs to understand the relationship between Paris and France, the concept of a primary language, and filter out irrelevant information to predict "French."
Probability Distribution: The model generates a probability distribution for the next word. Even without knowledge of a topic, it will pick a word based on the distribution it has seen, leading to confident but incorrect answers (hallucinations).

The Role of Evaluation Mechanisms

Incentivizing Guessing: The current evaluation mechanism encourages generating answers rather than acknowledging uncertainty.
Cross-Entropy and Binary Classification: Training paradigms often use cross-entropy, which can be converted into a binary classification problem: the model either predicts the next token correctly or incorrectly. This incentivizes guessing.
- Analogy: "If you do not know the answer but take a wild guess, you might get lucky and be right. Leaving it blank guarantees a zero."
Accuracy-Based Evaluation: Models are graded on accuracy (percentage of questions answered correctly), further encouraging guessing over admitting ignorance.

Proposed Solutions: Changing the Reward Function and Evaluation Metrics

Three Categories of Responses: For questions with a single right answer, responses can be categorized as accurate, errors, or abstinence (not guessing).
Incentivizing Abstinence: To reduce hallucinations, the reward function needs to be changed to incentivize abstinence. OpenAI is working on models that are encouraged to ask clarifying questions rather than just generating answers.
GPT-5 Example: GPT-5 has drastically lower hallucination rates (sometimes less than 1%) compared to previous versions.
Benchmarking Incentives: Current benchmarks also incentivize hallucinations because accuracy dominates leaderboards and model cards.
Penalizing Confident Errors: Penalizing confident errors more than uncertainty and giving partial credit for expressing uncertainty is crucial.
- Analogy: Standardized tests use negative marking for wrong answers or partial credit for leaving questions blank to discourage blind guessing.
Updating Accuracy-Based Evals: Widely used accuracy-based evaluations need to be updated to discourage guessing. "If the main scoreboards keep rewarding lucky guesses, models will keep learning to guess."

Addressing Common Claims About Hallucinations

Claim 1: Hallucinations will be eliminated by improving accuracy.
- Refutation: Accuracy will never reach 100% because some real-world questions are inherently unanswerable. Aiming for 100% accuracy incentivizes hallucinations.
Claim 2: Hallucinations are inevitable.
- Refutation: Language models can abstain when uncertain and be trained to say "I don't know."
Claim 3: Avoiding hallucinations requires a degree of intelligence exclusively available to larger models.
- Refutation: Smaller models may know their limits and abstain more readily. Larger models with more knowledge might struggle to assess their confidence level accurately.
- Example: A smaller model asked about an unfamiliar topic might say it doesn't know, while a larger model with partial knowledge might hallucinate.
Claim 4: Hallucinations are a mysterious glitch.
- Refutation: The statistical mechanisms through which hallucinations arise and are rewarded are understood.
Claim 5: A good hallucination evaluation is sufficient.
- Refutation: A single hallucination evaluation is insufficient against hundreds of traditional accuracy-based evaluations that penalize humility and reward guessing. All primary evaluation metrics need to be reworked to reward the expression of uncertainty.

Conclusion

The main takeaway is that relying solely on accuracy-based metrics for evaluating LLMs will not eliminate hallucinations. A fundamental shift in training mechanisms and evaluation metrics is needed to incentivize abstinence and reward the expression of uncertainty.