Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 8 - LLM Evaluation
By Unknown Author
Key Concepts
- Retrieval-Augmented Generation (RAG): A technique allowing LLMs to fetch information from external knowledge bases.
- Bi-encoder: Used for candidate retrieval in RAG, filtering potential relevant candidates.
- Cross-encoder: Used for re-ranking in RAG, more sophisticated than bi-encoders.
- Tool Calling: The ability of an LLM to select and use external tools with appropriate arguments.
- Agentic Workflows: Composed of RAG and tool calling, allowing multiple tool calls for data fetching and actions.
- ReAct Framework: A framework for agentic workflows, decomposing actions into Observe, Plan, and Act steps.
- LLM Evaluation: Quantifying the performance and output quality of LLMs.
- Inter-rater Agreement: Measuring consistency among human raters when evaluating subjective outputs.
- Cohen's Kappa, Fleiss' Kappa, Krippendorff's Alpha: Metrics to quantify inter-rater agreement relative to chance.
- Rule-based Metrics: Metrics that compare LLM outputs to fixed reference outputs.
- METEOR: A metric for evaluating translation, considering precision, recall, and word order.
- BLEU: A precision-focused metric for evaluating translation, using n-grams.
- ROUGE: A metric often used for summarization tasks.
- LLM-as-a-Judge: Using an LLM to evaluate the output of another LLM.
- Structured Output: A technique to ensure LLM outputs adhere to a specific format (e.g., JSON).
- Constraints-guided Decoding: A method to constrain the decoding process to produce outputs in a desired format.
- Position Bias: The tendency of an LLM judge to favor responses based on their presentation order.
- Verbosity Bias: The tendency of an LLM judge to favor longer, more detailed responses.
- Self-enhancement Bias: The tendency of an LLM judge to favor responses generated by itself.
- Factuality: The accuracy of information presented in an LLM's output.
- Agentic Workflow Evaluation: Assessing the performance of agents in executing tasks through tool use.
- Punt: An agent's failure to answer a query, often by stating it cannot perform the task.
- Tool Router/Selector: A component that filters potential tools for an agent.
- Tool Hallucination: An LLM calling a non-existent tool.
- Benchmarks: Standardized tests used to evaluate LLM capabilities across various domains.
- MMLU (Massive Multitask Language Understanding): A benchmark testing LLMs on diverse tasks with multiple-choice questions.
- AIME (American Invitational Mathematics Examination): A math competition benchmark for LLMs.
- PIQA (Physical Interaction Question Answering): A common sense reasoning benchmark.
- SWE-bench (Software Engineering Benchmark): A benchmark for evaluating LLMs on coding tasks.
- HarmBench: A benchmark for evaluating LLM safety.
- tau-bench: A benchmark for evaluating agentic workflows.
- Pass@k: A metric for evaluating the success rate of k attempts.
- Pass@k (with a hat): A metric for evaluating the probability that all k attempts succeed.
- Data Contamination: The issue of LLMs being trained on benchmark data, invalidating evaluation results.
- Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure."
LLM Evaluation: Quantifying Performance and Output Quality
This lecture focuses on the critical aspect of evaluating Large Language Models (LLMs), emphasizing that understanding how to measure performance is essential for improvement. The discussion revisits previous topics like Retrieval-Augmented Generation (RAG) and tool calling, which form the basis of agentic workflows.
Recap of Previous Lectures
- Retrieval-Augmented Generation (RAG):
- Purpose: Enables LLMs to access external knowledge bases.
- Two-step process:
- Candidate Retrieval: Typically uses a bi-encoder setup (e.g., Sentence-BERT) to filter potential relevant documents for a query.
- Re-ranking: Employs more sophisticated cross-encoders to refine the retrieved candidates.
- Quantification of retrieval system performance was also discussed.
- Tool Calling:
- Allows LLMs to select and execute external tools with specific arguments based on user queries.
- Agentic Workflows:
- A combination of RAG and tool calling.
- Enables LLMs to make multiple tool calls to fetch data and perform actions.
- ReAct framework (Reasoning + Acting) is a common approach, decomposing tasks into Observe, Plan, and Act steps.
- AI-assisted coding is a successful application.
Defining LLM Evaluation
The lecture clarifies that "evaluation" in this context primarily refers to assessing the output quality of an LLM, rather than system-level metrics like latency or cost. This is challenging due to the free-form nature of LLM outputs, which can include natural language, code, and mathematical reasoning.
Human Ratings: The Ideal but Impractical Approach
- Ideal Scenario: Having humans rate every LLM output for quality.
- Challenges:
- Cost-intensive: Extremely expensive and time-consuming.
- Subjectivity and Inter-rater Agreement: Human judgments can be subjective. Metrics like agreement rate are used, but a simple agreement rate can be misleading. For instance, if raters have a 50% chance of agreeing randomly, a 50% agreement rate is not indicative of true alignment.
- Metrics for Agreement:
- Cohen's Kappa: Accounts for chance agreement, providing a positive score if observed agreement exceeds chance.
- Extensions like Fleiss' Kappa and Krippendorff's Alpha exist for multiple raters.
- Time and Cost: Rating thousands of outputs is impractical.
Rule-based Metrics: Leveraging Fixed References
- Methodology: Humans provide fixed "reference" or "ideal" outputs for a set of prompts. LLM outputs are then compared against these references using predefined metrics.
- Advantage: Allows for iterative model improvement without constant human re-rating.
- Key Metrics:
- METEOR (Metric for Evaluation of Translation with Explicit Ordering):
- Calculates an F-score (harmonic mean of precision and recall) and a penalty for word order mismatches.
- Precision: Proportion of unigrams in the predicted sequence matching the reference.
- Recall: Proportion of unigrams in the reference matching the prediction.
- Penalty: Based on contiguous chunks matched, incentivizing better ordering.
- Limitations: Arbitrary hyperparameters, limited flexibility for stylistic variations, and reliance on unigram matching.
- BLEU (Bilingual Evaluation Understudy):
- A precision-focused metric using n-grams.
- Includes a brevity penalty to discourage overly short translations.
- ROUGE: Commonly used for summarization tasks.
- METEOR (Metric for Evaluation of Translation with Explicit Ordering):
- Limitations of Rule-based Metrics:
- Lack of Stylistic Variation Handling: Struggle with different phrasings of the same meaning.
- Correlation Issues: May not strongly correlate with human judgments.
- Still Requires Human Input: Initial reference outputs are needed.
LLM-as-a-Judge: Leveraging LLMs for Evaluation
- Concept: Utilizes a pre-trained LLM (tuned for human preference) to evaluate the output of another LLM.
- Input: Prompt, LLM response, and evaluation criteria.
- Output: A score (e.g., pass/fail) and a rationale explaining the judgment.
- Key Benefits:
- No Need for Reference Text: Leverages the LLM's inherent knowledge and understanding of human preferences.
- Interpretability: The rationale provides insight into the scoring.
- Prompting Strategy: It's often beneficial to ask the LLM to output the rationale before the score, similar to chain-of-thought reasoning, to improve performance.
- Ensuring Structured Output: To guarantee a parsable rationale and score, techniques like constraints-guided decoding (also known as structured output) are used, often by specifying a desired format like JSON.
- Types of LLM-as-a-Judge Setups:
- Single Response Evaluation: Judging one response as good or bad.
- Pairwise Comparison: Determining which of two responses is better. This can be used to synthetically generate preference data for training reward models.
Potential Biases in LLM-as-a-Judge
- Position Bias: The judge may favor responses presented earlier.
- Remedy: Presenting responses in different orders and using majority voting.
- Verbosity Bias: The judge may favor longer, more detailed responses.
- Remedies: Explicitly stating in guidelines to avoid this bias, providing in-context learning examples, or penalizing output length.
- Self-enhancement Bias: The judge may favor responses generated by itself.
- Remedy: Using a different, ideally larger and more capable, LLM for judging than for generation.
Best Practices for LLM-as-a-Judge
- Crisp Guidelines: Provide explicit criteria for evaluation.
- Binary Scale: Prefer binary scales (pass/fail) over granular ones for simpler judgment and better alignment with human preference.
- Rationale Before Score: Output the rationale first to improve the judge's reasoning process.
- Mitigate Biases: Employ remedies for identified biases (position, verbosity, self-enhancement).
- Calibrate with Human Ratings: Compare LLM-as-a-Judge scores with human ratings to ensure alignment and identify areas for prompt improvement.
- Low Temperature: Use a low temperature (e.g., 0.1-0.2) for reproducible evaluation experiments.
Dimensions of LLM Output Measurement
- Task Performance: Usefulness, factuality, relevance.
- Response Alignment: Tone, style, safety (absence of unsafe elements).
- Factuality Assessment:
- Process:
- Fact Extraction: Convert the text into a list of individual facts.
- Fact Checking: Verify each fact using techniques like RAG or web search.
- Aggregation: Combine fact-checking results, potentially with weights for fact importance, to derive an overall factuality score.
- Nuance: Factuality can be nuanced, with varying degrees of error.
- Process:
Evaluating Agentic Workflows
- Focus: Assessing the performance of agents in executing tasks through tool use.
- Failure Modes in Tool Prediction:
- Not using a tool when needed (Punt): The agent fails to answer or states it cannot.
- Cause: Issues with the tool router/selector (recall error).
- Remedy: Adjust the tool router or fine-tune the model's ability to recognize tool usage.
- Tool Hallucination: The agent calls a non-existent tool.
- Cause: Weak model, poorly defined APIs, or unclear horizontal instructions.
- Remedy: Upgrade the model, refine API names and descriptions, clarify instructions.
- Using the wrong tool: The agent selects an inappropriate tool.
- Cause: Issues with the tool router or conflicting API scopes.
- Remedy: Improve tool router recall, clarify API definitions and scopes.
- Incorrect arguments: The agent uses the right tool but with wrong parameters.
- Cause: Missing context (e.g., location), lack of understanding of argument requirements.
- Remedy: Ensure context carries necessary information, introduce location finder tools, retrain the model on tool usage, or rewrite APIs.
- Not using a tool when needed (Punt): The agent fails to answer or states it cannot.
- Failure Modes in Tool Execution:
- Tool returns an error: The tool's internal logic fails.
- Remedy: Fix the tool implementation. Errors can sometimes be informative if conveyed meaningfully.
- Tool returns no response: The tool fails to output anything.
- Remedy: Ensure tools always return meaningful outputs, even an empty JSON, to inform the agent.
- Tool returns an error: The tool's internal logic fails.
- Failure Modes in Output Synthesis:
- Agent fails to synthesize output: The agent cannot interpret or present the tool's output meaningfully.
- Cause: Model lacks grounding, output is too verbose or not presented in a structured way.
- Remedy: Upgrade the model, simplify tool outputs, use structured data formats (e.g., classes in Python).
- Agent fails to synthesize output: The agent cannot interpret or present the tool's output meaningfully.
Common Trends in Agentic Workflow Failures
- Modeling: Improving the model's reasoning and grounding capabilities.
- Context Relevance: Enhancing the relevance of information in the context window.
- API Modeling: Refining tool API descriptions and descriptions through SFT tuning or prompting.
- Tool Implementation: Fixing bugs or improving the logic of the tools themselves.
- Methodical Error Categorization: Organizing and addressing errors systematically.
LLM Benchmarks: Standardized Evaluation
Benchmarks provide a standardized way to compare LLM performance.
- Knowledge-based Benchmarks:
- Purpose: Test the LLM's ability to recall facts.
- Example: MMLU (Massive Multitask Language Understanding)
- Covers ~60 diverse tasks (everyday life, law, medicine).
- Uses a multiple-choice format (question + 4 options) for standardized evaluation.
- Focuses on knowledge retention from pretraining.
- Reasoning Benchmarks:
- Purpose: Assess the LLM's ability to infer responses through thought processes.
- Examples:
- Math: AIME (American Invitational Mathematics Examination), a difficult math exam with a constrained three-digit answer format.
- Common Sense Reasoning: PIQA (Physical Interaction Question Answering), which uses two-option choices and tests understanding of everyday physical interactions.
- Coding Benchmarks:
- Purpose: Evaluate LLMs on solving complex coding problems.
- Applications: AI-assisted coding, agentic tool use.
- Example: SWE-bench (Software Engineering Benchmark), which uses GitHub issues and pull requests with tests to assess code generation and bug fixing capabilities.
- Safety Benchmarks:
- Purpose: Evaluate LLMs for harmful behavior, copyright infringement, etc.
- Challenges: Safety policies are provider-specific, making cross-model comparison difficult. Benchmarks need to align with provider policies.
- Example: HarmBench, which categorizes harmful behavior (standard, copyright, contextual, multimodal) and uses classifiers to assess attacks, even if unsuccessful.
- Agentic Workflow Benchmarks:
- Purpose: Measure the performance of agents in interactive scenarios.
- Example: tau-bench (Tool Agent Users benchmark)
- Simulates user-agent interactions in domains like airline and retail.
- Uses a separate LLM to play the role of the user.
- Evaluates success based on database changes and task completion.
- Employs Pass@k (probability of at least one success) and Pass@k (with a hat) (probability of all k successes) metrics for reliability.
Data Contamination and Benchmark Limitations
- Data Contamination: LLMs may have been trained on benchmark data, invalidating results. Techniques like hashing and blocklists are used to mitigate this.
- Goodhart's Law: Benchmarks should be viewed critically, as they can be gamed if they become the sole target.
- Real-world Performance: Benchmark results are indicative but not always representative of real-world performance. User experience and testing on specific use cases are crucial.
Conclusion
Evaluating LLMs is a multifaceted challenge. While human ratings are ideal, practical constraints lead to the use of rule-based metrics and, more recently, LLM-as-a-Judge approaches. Understanding potential biases in LLM evaluation and employing best practices are crucial for obtaining reliable assessments. Benchmarks provide standardized measures across various capabilities, but their limitations, particularly data contamination and the potential for gaming, must be considered. Ultimately, a combination of benchmark performance, user experience, and specific use-case testing is necessary to determine the true effectiveness of an LLM.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 8 - LLM Evaluation". What would you like to know?