How to Solve the Biggest Problem with AI

Reducing Hallucinations in Large Language Models

Key Concepts:

Hallucinations: LLMs confidently generating false or misleading information.
Retrieval Augmented Generation (RAG): Grounding LLM responses in retrieved external data.
Chain of Verification: Systematically fact-checking LLM outputs.
Self-Consistency: Running prompts multiple times and taking the majority vote answer.
LLM Council: Utilizing multiple LLMs and a "chairman" model for comprehensive evaluation.
Chain of Thought Reasoning: LLM reasoning step-by-step, which can amplify errors if the initial premise is flawed.
Tool Augmented Verification: Utilizing search engines or databases during the verification step.

I. The Prevalence and Impact of Hallucinations

Large Language Models (LLMs) – including ChatGPT, Gemini, Claude, and Grock – consistently exhibit hallucinations, confidently presenting fabricated information as fact. A particularly striking example cited was the LLM’s erroneous response regarding the existence of a seahorse emoji, a bug that persisted even in older versions of models. While often humorous in isolated cases, hallucinations become significantly problematic when interwoven with partial truths and presented with convincing articulation. This makes detection challenging.

II. Retrieval Augmented Generation (RAG) – Grounding LLM Responses

The most effective method for mitigating hallucinations is Retrieval Augmented Generation (RAG). RAG addresses the core issue of LLMs “guessing” the next word based on learned patterns by providing them with external, verifiable information. A recent survey revealed a common misconception about how LLMs function: 45% believe they look up answers in a database, 21% think they follow pre-written scripts, while only 28% understand they predict the most likely next word.

NotebookLM is highlighted as the most accessible tool for implementing RAG. It allows users to upload various sources (PDFs, YouTube videos, websites) or discover relevant sources directly within the platform (up to 50 sources per notebook on the free plan, with a limit of 100 notebooks). NotebookLM automatically generates citations, enabling users to verify the source of information. However, the quality of the sources remains crucial; RAG doesn’t eliminate issues stemming from biased or inaccurate source material.

III. Verification Prompts for Enhanced Accuracy

To address potential source biases and gaps, three specific verification prompts are recommended:

Contradiction Check: “Looking only at the sources in this notebook, identify any areas where the sources disagree with each other and any clear contradictions or conflicting claims.” This identifies inconsistencies within the source material.
Gap Identification: “Based on these sources, what important questions or subtopics about the topic are missing or barely covered? List the biggest gaps that would need to be filled to really understand this topic well. Do not invent details. Just describe what is missing.” This highlights areas where knowledge is incomplete.
Perspective Search: “Are there any contrarian, alternative, or lesser-known viewpoints on this topic that are likely not represented in these sources? Describe those possible viewpoints at a high level and suggest what kinds of sources I would need to look for to find them.” This encourages exploration beyond mainstream perspectives.

These prompts are designed to be used proactively before drawing conclusions, and a comprehensive list of prompts is available via a link in the video description.

IV. Techniques for Standard LLM Interfaces (ChatGPT, Gemini)

For users interacting directly with LLMs like ChatGPT or Gemini, the following techniques are recommended:

Enable Search: Ensure the LLM utilizes search functionality, or explicitly instruct it to do so.
Provide Context: Upload relevant files or URLs to provide the LLM with specific information.
Constrain Responses: Prompt the LLM to “only use the facts it finds through search or uploaded documents” and to “say I don’t know” if the answer isn’t explicitly present in the provided context.
Narrow Questions: Ask specific, focused questions to minimize the need for the LLM to extrapolate.
Confidence Levels: Request the LLM to assign confidence levels (high, medium, low) to each claim and to list any uncertainties.

The prompt structure is: “Here is some context… Answer this question. If the answer is not clearly in the text, say I don’t know.”

V. Chain of Verification – Systematic Fact-Checking

To further enhance accuracy, the “Chain of Verification” method is introduced. This involves a two-step process:

Initial Generation: The LLM generates an initial response.
Factual Claim Extraction & Verification: The response is analyzed to extract all factual claims (dates, names, statistics, etc.). These claims are then converted into standalone questions and verified using the LLM’s search tool. Conflicting search results are noted.

Finally, the original question is re-answered using only the verified information. This separates generation from verification, reducing the risk of propagating errors. Research from Google DeepMind supports the effectiveness of this method.

VI. Advanced Reasoning Verification – The LLM Council & Self-Consistency

While RAG and Chain of Verification address factual accuracy, verifying reasoning is equally important. Research indicates that Chain of Thought reasoning (asking the LLM to think step-by-step) can amplify errors if the initial premise is flawed.

Two techniques are presented:

Self-Consistency: Running the same prompt multiple times (5-40) and selecting the majority vote answer. This leverages the fact that hallucinations tend to vary across attempts.
The Auditor: Utilizing a second LLM to evaluate the response generated by the first. LLMs are often better at critique than creation.
LLM Council: A tool developed by Andre Carpathy that automates the process of querying multiple LLMs, having them review each other’s responses anonymously, and compiling a final answer based on a “chairman” model’s evaluation.

Combining these methods – RAG, Chain of Verification, Self-Consistency, and the LLM Council – provides a robust approach to minimizing hallucinations and ensuring accurate, well-reasoned responses.

VII. Synthesis & Conclusion

The video emphasizes a layered approach to mitigating LLM hallucinations. Starting with grounding responses in verifiable sources via RAG (using tools like NotebookLM), then systematically verifying facts through Chain of Verification, and finally leveraging techniques like Self-Consistency and the LLM Council to assess reasoning, users can significantly improve the reliability of LLM outputs. The level of rigor should be proportional to the stakes of the task. While no technique eliminates hallucinations entirely, these methods substantially reduce their occurrence and make them easier to detect. Resources, including prompts and links to tools, are provided in the video description. Further learning is available through a comprehensive AI course platform, Futuredia.