Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 7 - Agentic LLMs

Key Concepts

Reasoning Models vs. Vanilla LLMs: Reasoning models output a hidden reasoning chain alongside the response, improving performance on tasks like math and coding, unlike vanilla LLMs which provide a direct response.
GRPO (Group Relative Policy Optimization): An RL algorithm for training reasoning models that doesn't train a value function and computes advantages relative to other completions of the same prompt.
Length Bias: A phenomenon where RL training can lead to models outputting longer responses, even if performance plateaus.
RAG (Retrieval Augmented Generation): A technique to augment LLM prompts with relevant information retrieved from an external knowledge base, addressing knowledge cutoffs and improving accuracy.
Knowledge Cutoff: The date up to which an LLM's training data is current, limiting its knowledge of recent events.
Context Length: The maximum amount of text an LLM can process at once, measured in tokens.
Needle in a Haystack Test: An experiment to evaluate an LLM's ability to retrieve specific information from a long prompt.
Embeddings: Numerical representations of text chunks used for semantic similarity search.
Chunking: Dividing documents into smaller, manageable pieces for knowledge base creation.
Candidate Retrieval: The first stage of RAG, filtering a large knowledge base down to a smaller set of potentially relevant documents.
Ranking/Re-ranking: The second stage of RAG, ordering the candidate documents to prioritize the most relevant ones.
Bi-encoder: An architecture where the query and document are encoded independently.
Cross-encoder: An architecture where the query and document are encoded together, allowing for richer interaction.
BM25: A heuristic scoring function based on keyword overlap, useful for keyword-specific retrieval.
Semantic Similarity: Finding documents with similar meaning, not necessarily identical keywords.
Hybrid Search: Combining semantic similarity search with heuristic-based search (e.g., BM25).
Contextualization: Prepending summaries or relevant information to chunks to improve their coherence.
Prompt Caching: A technique to reduce computation by reusing pre-computed activations for repeated prompt prefixes.
NDCG (Normalized Discounted Cumulative Gain): A metric for evaluating ranking quality, considering the position of relevant documents.
MRR (Mean Reciprocal Rank): A metric that considers the rank of the first relevant document.
Precision@K and Recall@K: Standard classification metrics adapted for ranking tasks.
Tool Calling/Function Calling: Enabling LLMs to interact with external systems by invoking predefined functions with specific arguments.
Agent: A system that autonomously pursues goals and completes tasks, often involving multiple tool calls and reasoning loops.
ReAct (Reason + Act): A framework for agents that decomposes tasks into observe, plan, and act stages.
Agent-to-Agent Protocol: A proposed standard for communication and interaction between different agents.
Safety Classifiers/Safeguards: Mechanisms to prevent LLMs from generating harmful or unsafe outputs, especially when using tools.
Data Exfiltration: A safety risk where sensitive data is leaked through tool usage.

Recap of Previous Lecture (Reasoning Models)

Last lecture focused on improving LLM reasoning capabilities. The key takeaway was the difference between "vanilla LLMs" and "reasoning models." While vanilla LLMs directly output a response to a prompt, reasoning models first generate a hidden "reasoning chain" before providing the final response. This process was shown to enhance performance on tasks like math and coding.

The lecture introduced GRPO (Group Relative Policy Optimization), a Reinforcement Learning (RL) algorithm for training these reasoning models. A notable characteristic of GRPO is its absence of a value function. It works by computing an "advantage" for each output completion relative to others for the same prompt. By rewarding the model for both generating a reasoning chain and producing a good response, GRPO demonstrated improvements in performance on datasets like AIM (a challenging math problem set).

However, a challenge identified was length bias, where models tended to produce increasingly longer outputs even as performance plateaued. This was attributed to a term in the GRPO loss formulation that differentiated token contributions based on response length. Mitigation strategies discussed included normalization factors (like in DAPo) or simply removing the normalization term (as in "GRPO done right").

1. RAG (Retrieval Augmented Generation)

1.1 Addressing Knowledge Cutoffs and Evolving Information

Problem: LLMs have a knowledge cutoff date, meaning they are unaware of information that emerged after their training data was collected. For instance, a model trained until September 30, 2024, would not know about events after that date.
Limitations of Retraining/Fine-tuning:
- Regression Risk: Changing an LLM's knowledge can negatively impact its existing capabilities.
- Impracticality: Continuously retraining or fine-tuning for new information is computationally expensive and difficult to manage across multiple use cases.
Limitations of Naive Prompt Augmentation:
- Context Length Limits: LLMs have finite context windows (e.g., hundreds of thousands of tokens, roughly equivalent to hundreds of pages). Injecting all new information directly into the prompt is often infeasible.
- Performance Degradation: Feeding too much irrelevant information can confuse the LLM and degrade its performance, as demonstrated by the "needle in a haystack" test. This test shows that LLMs struggle to retrieve specific facts from very long prompts, especially if the fact is in the first half.
- Cost: LLM calls are priced per token, making longer prompts significantly more expensive.

1.2 The RAG Framework: Retrieve, Augment, Generate

RAG addresses these limitations by retrieving only the most relevant information from an external knowledge base and augmenting the prompt with it.

Core Idea: Instead of overwhelming the LLM with all new information, RAG intelligently fetches and injects only the necessary pieces.
Three Main Steps:
1. Retrieve: Given a user's prompt, find relevant documents from a knowledge base.
2. Augment: Add the retrieved information to the original prompt.
3. Generate: Feed the augmented prompt to the LLM to produce the final response.

1.3 Building the Knowledge Base

Document Collection: Gather all potentially useful external documents.
Chunking: Divide documents into smaller pieces called chunks, typically with a maximum length of a few hundred tokens.
- Chunk Size: A trade-off exists between chunk size and context. Too small, and context is lost; too large, and embeddings may not be meaningful. Around 500 tokens is a common choice.
- Overlap: Some overlap between chunks is beneficial to maintain context. Typically in the low hundreds of tokens.
Embeddings: Compute numerical embeddings for each chunk. These embeddings capture the semantic meaning of the chunk.
- Embedding Model: Can be a pre-trained model (common) or a custom-trained one. The purpose is to represent chunks such that relevant ones are close in the embedding space.
- Embedding Size: Typically in the thousands (e.g., 1,500), with larger sizes potentially capturing more nuance but increasing computational cost.
Hyperparameters: Key parameters to tune include embedding size, chunk size, and chunk overlap.

1.4 Retrieval Process: Two Stages

The retrieval process typically involves two stages to efficiently find relevant chunks from a potentially massive knowledge base.

Stage 1: Candidate Retrieval
- Goal: Quickly filter down a vast number of chunks to a smaller set of potentially relevant candidates (e.g., over 100).
- Method: Semantic similarity search using embeddings.
  - Embed the user's query.
  - Compute cosine similarity between the query embedding and the embeddings of all chunks.
  - Select the top-scoring chunks.
- Techniques: Approximate Nearest Neighbor (ANN) methods are often used for efficiency with large datasets.
- Model Architecture: Typically uses a bi-encoder setup, where the query and chunk are encoded independently. Sentence-BERT is a popular model for generating embeddings tailored for similarity search.
Stage 2: Ranking/Re-ranking
- Goal: Refine the list of candidates to ensure the most relevant documents are at the top.
- Method: Uses a more computationally intensive model to re-rank the candidate set.
- Model Architecture: Often employs a cross-encoder setup, where the query and chunk are processed together, allowing for deeper interaction and a more accurate relevance score.
- Hybrid Search: Combines semantic similarity search with heuristic-based methods like BM25, which scores based on keyword overlap. This is useful when exact keyword matching is important.

1.5 Addressing Chunking and Query-Document Mismatch

Coherent Chunks: Naive chunking can lead to chunks that lack context.
- Contextualization: An LLM can be used to generate a short summary or context for each chunk, based on the entire document. This can be computationally expensive, but prompt caching can mitigate costs by reusing computations for repeated prompt prefixes.
Query-Document Embedding Mismatch: Queries (short, question-like) and documents (longer text) may have different characteristics, making direct embedding comparison less effective.
- HyDE (Hypothetical Document Embeddings): Generates a hypothetical document from the prompt using an LLM and then embeds that document.
- Separate Encoders: Using different encoders for queries and documents.

1.6 Evaluation Metrics for Retrieval

To assess the performance of the retrieval system, several metrics are used, similar to those in recommendation systems and search:

NDCG (Normalized Discounted Cumulative Gain): Measures the quality of a ranking by considering the position of relevant documents. Higher scores are given to relevant documents ranked higher.
MRR (Mean Reciprocal Rank): The inverse of the rank of the first relevant document.
Precision@K: The proportion of retrieved documents in the top K that are actually relevant.
Recall@K: The proportion of all relevant documents that are found within the top K retrieved documents.

Benchmarks: Datasets like the Massive Text Embedding Benchmark (MTEB) are used to evaluate retriever performance.

2. Tool Calling and Agents

2.1 Tool Calling: Extending LLM Capabilities

Concept: Tool calling (or function calling) allows LLMs to interact with external systems and perform actions by invoking predefined functions. This addresses the LLM's inability to access real-time data or perform computations beyond its internal knowledge.
Definition (IBM): "Tool calling allows autonomous systems to complete complex tasks by dynamically accessing and may act upon external resources."
Key Aspects:
- Task Completion: LLMs can now complete tasks that require external information or actions.
- External Resources: Access to APIs, databases, or other services.
Example: Finding a teddy bear near Stanford. A vanilla LLM would be unable to provide real-time availability. With tool calling, it can invoke a "find teddy bear" function.

2.2 Function Definition and Usage

Function API: Tools are defined by their API, including:
- Name: The identifier of the function.
- Description: A clear explanation of what the function does.
- Input Arguments: The parameters the function expects.
- Output: The structure and type of data the function returns.
LLM's Role: The LLM's task is to:
1. Recognize the need for a tool based on the user's query.
2. Identify the correct tool from a set of available tools.
3. Infer the correct arguments to pass to the tool.
Implementation Details: The actual code implementation of the tool is not exposed to the LLM; only the API definition and documentation are provided.

2.3 The Three-Stage Process for Tool Calling

Preamble with Function API: The LLM's context includes the function API definitions and their documentation.
Function Call: The LLM, based on the user query, generates the arguments for the appropriate function. This function is then executed by the system.
Response Generation: The structured output from the executed function is fed back to the LLM, which then generates a natural language response to the user.

2.4 Training for Tool Calling

Supervised Fine-Tuning (SFT):
- Stage 1 SFT: Training the LLM to map user queries and function APIs to the correct function arguments.
- Stage 2 SFT: Training the LLM to convert the structured tool output into a natural language response, considering the entire conversation history.
Zero-Shot/Few-Shot Learning:
- Prompt Engineering: Providing detailed explanations and examples within the prompt to guide the LLM's tool usage.
- Iterative Explanation Refinement: Using an LLM to generate and refine explanations for tool usage, evaluated against a set of SFT pairs. This avoids manual prompt writing.

2.5 Handling Multiple Tools and Tool Selection

Challenge: Providing too many tools in the context can lead to the "needle in a haystack" problem, where the LLM struggles to identify the correct tool.
Tool Selection/Routing: A system that first filters a large set of tools down to a smaller, relevant subset before passing them to the LLM. This can be done using another LLM or a dedicated selector.
Standardization (MCP - Model Context Protocol): A protocol developed by Anthropic to standardize how tools are exposed to LLMs, ensuring interoperability across different LLM providers. It defines concepts like MCP servers, tools, prompts, and resources.

2.6 Agents: Autonomous Goal Pursuit

Definition: An agent is a system that autonomously pursues goals and completes tasks on a user's behalf, often involving multiple steps and reasoning loops. This is a layer above simple tool calling.
Key Difference from Tools: Agents incorporate higher-level reasoning and can perform sequences of actions, not just single function calls.
ReAct (Reason + Act) Framework: Decomposes complex tasks into iterative stages:
1. Observe: Understand the user's query and the current state of the world.
2. Plan: Determine the next actionable step.
3. Act: Execute a tool or action based on the plan. This cycle repeats until the goal is achieved.
Example: A teddy bear is cold.
- Observe: User query "my teddy bear is cold." Identify the need to check room temperature.
- Plan: Determine the current room temperature.
- Act: Use a get_current_room_temperature tool.
- Observe: Temperature is 65°F (cold).
- Plan: Increase the temperature.
- Act: Use a set_thermostat tool with an increased temperature.
- Observe: Temperature is now at the desired level.
- Output: Inform the user that the temperature has been adjusted.
Multi-Agent Systems: Multiple agents can communicate and collaborate to achieve complex goals. The Agent-to-Agent Protocol aims to standardize this communication.

2.7 Safety Considerations for Tools and Agents

Risks: The ability of LLMs to execute actions introduces significant safety risks, such as data exfiltration (e.g., an email agent being prompted to send sensitive information).
Mitigation Strategies:
- Training Stage: Incorporating safety data and objectives during SFT and RL training (e.g., harmlessness components).
- Inference Safeguards: Using safety classifiers to monitor conversation history and predict the safety of LLM outputs.
Benchmarks: Resources like the Agent Safety Bench help evaluate LLM safety.
Real-World Incidents: The lecture references a large-scale cyberattack on Anthropic, highlighting the critical importance of safety in advanced LLM capabilities.

2.8 Challenges and Advice

Divergence: A major challenge is the risk of LLMs diverging from the intended goal or making errors in tool argument prediction.
Developing Capabilities: Reasoning gaps can be addressed through SFT, but ideally, models should develop these capabilities intrinsically.
Building Tools/Agents:
- Start Small: Begin with simple cases and gradually increase complexity.
- Start Smart: Use the most capable models first to understand the potential headroom.
- Debuggability: Leverage the reasoning chains output by LLMs to identify and fix issues.
Favorite Use Case: Assistant coding, where LLMs can handle complex piping and free up mental load, but users must still understand code fundamentals to judge its correctness.

This lecture provides a comprehensive overview of how LLMs can be extended beyond their internal knowledge to interact with the external world through RAG, tool calling, and agentic workflows, while also emphasizing the critical importance of safety and robust evaluation.