Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 12: Evaluation

By Unknown Author

Share:

Key Concepts

  • Evaluation: The process of converting abstract constructs (e.g., "reasoning," "helpfulness") into concrete metrics.
  • Perplexity: A measure of how well a probability distribution predicts a sample; the core metric for base language models.
  • Scaling Laws: The observation that model performance (often measured by perplexity) improves predictably with increased compute, data, and parameters.
  • In-Distribution vs. Out-of-Distribution (OOD): Evaluating on data from the same source as training vs. evaluating on unseen, distinct benchmarks.
  • Chain of Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before providing a final answer.
  • Agentic Benchmarks: Evaluations for systems that use tools, scaffolds, and environments (e.g., coding, terminal access) rather than just text generation.
  • Ecological Validity: The extent to which an evaluation reflects real-world usage.
  • Data Contamination: The risk that test data was inadvertently included in the model's training set.

1. The Philosophy of Evaluation

Evaluation is not merely a mechanical process; it is the "North Star" that shapes AI development. The core challenge is the Abstract-to-Concrete Mapping: defining what "good" means (e.g., intelligence, cost-efficiency, user preference) and creating a metric that captures it.

  • Economic Lens: Platforms like OpenRouter track usage statistics, assuming that if users pay for a model, it is "good."
  • Preference Lens: Chatbot Arena (LMSYS) uses Elo ratings based on human pairwise comparisons to rank models.

2. Perplexity and Base Model Evaluation

Perplexity remains the fundamental metric for language models because it measures how well a model captures the true distribution of language.

  • The "Perplexity is All You Need" Argument: By minimizing log loss/perplexity, a model theoretically approaches the true distribution of data, which would allow it to solve any task (e.g., Q&A, reasoning).
  • Limitations: Perplexity penalizes all tokens equally, even those that are trivial or incidental. Conditional Perplexity is a refinement that focuses on relevant tokens (e.g., the answer to a question).
  • Trust Issues: Unlike downstream task accuracy, perplexity requires trusting that the model's output probabilities are valid and sum to one.

3. Exam-Based Benchmarks

Exams provide controlled, unambiguous environments for testing knowledge and reasoning.

  • MMLU (Massive Multitask Language Understanding): Tested general knowledge across 57 subjects. It was radical at the time for using few-shot prompting to test general-purpose capabilities.
  • GPQA (Google-Proof QA): Designed to be unsolvable by non-experts even with Google access. It uses a "Diamond Set" of questions validated by PhDs.
  • Humanity’s Last Exam (HLE): A highly difficult, multimodal benchmark designed to be "the last" exam, though models continue to improve on it.
  • The Saturation Cycle: A recurring pattern exists where a benchmark is created, models perform poorly, models improve, the benchmark is saturated, and a harder one is created.

4. Chat and Agentic Benchmarks

  • Chat Benchmarks: Since open-ended questions lack ground truth, LLM-as-a-Judge (e.g., AlpacaEval) is used. These often require debiasing, as judges tend to favor longer responses.
  • Agentic Benchmarks: These evaluate the model plus its scaffold (the logic for tool use).
    • SWE-bench: Evaluates coding ability by requiring models to fix GitHub issues and pass unit tests.
    • Terminal Bench: Uses a computer terminal environment to test general-purpose task execution.
    • Cybersecurity (CTF): Tests an agent's ability to hack into a server to retrieve a "flag."
  • Scaffold Importance: For agents, explicit planning (to-do lists), hierarchical delegation, and memory management (reading/writing files) are as critical as the underlying model.

5. Pure Reasoning and Safety

  • ARC-AGI: Attempts to isolate "fluid intelligence" from learned knowledge by using visual puzzles that require reasoning rather than fact retrieval.
  • Safety Benchmarks:
    • HarmBench: Tests refusal capabilities against harmful prompts.
    • Jailbreaking: The use of adversarial prompts (e.g., GCG - Greedy Coordinate Gradient) to bypass safety filters.
    • Dual-Use: A significant challenge where the same agentic capabilities (e.g., penetration testing) can be used for both security and harm.

6. Methodological Challenges

  • Data Contamination: It is often impossible to know if a model was trained on the test set. Mitigation strategies include:
    • Fresh Evals: Creating benchmarks from data published after the model's training cutoff.
    • Private Evals: Using internal, non-public data for testing.
    • Statistical Inference: Checking for patterns that suggest the model has "memorized" the test set.
  • Auditability: The speaker emphasizes that quantitative metrics are insufficient. Developers must perform qualitative audits—looking at the actual outputs—to ensure the benchmark measures what it claims to measure.

Synthesis

There is no "one true evaluation." The choice of benchmark depends entirely on the goal: business decision-making, scientific research, or safety assurance. As models evolve, the field is shifting from simple text-based perplexity to complex, agentic, and ecologically valid environments. The most robust approach involves a combination of diverse benchmarks, clear rubrics, and constant vigilance against data contamination and metric gaming.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video