Stanford Webinar - Making GenAI Useful: Lessons from Research and Deployment

Key Concepts

Base Models: Raw, pre-trained language models trained on next token prediction using vast language corpora. Possess latent capabilities but are challenging to use directly.
Post-Training/Alignment: The process of aligning language models with human preferences, eliciting specific capabilities, and improving usability. Often involves supervised fine-tuning (SFT) and reinforcement learning (RL) algorithms.
Emergent Properties: Unexpected capabilities arising from the scale and complexity of pre-training, making it difficult to predict and plan for specific model behaviors.
Internal Evals: Evaluation metrics and datasets created internally, focusing on usability and production use cases, rather than solely relying on external benchmarks.
Steerability: The ability to control and guide model behavior through instructions, prompts, and system design, enabling developers to tailor model outputs to specific needs.
Truthfulness/Groundedness: The extent to which model outputs are faithful to underlying documents, sources, or real-world facts, balanced against the need for creativity and brainstorming.
Model Spec: An open-sourced document outlining the intended behavior and values of a language model, serving as a guide for development and a point of discussion for community feedback.
Few-Shot Examples: Providing a small number of input-output pairs in the prompt to guide the model's behavior and improve its ability to follow instructions, especially for customized or unusual requests.
Capabilities Overhang: The gap between the current capabilities of language models (e.g., GPT-4) and the extent to which these capabilities have been fully utilized and integrated into real-world applications.
Instruction Following: The ability of a model to accurately and completely adhere to instructions given in a prompt, including various aspects such as format, negative constraints, and content guidelines.
Interpretability: The degree to which the internal workings and reasoning processes of a language model can be understood and explained.

Main Topics and Key Points

Base Models vs. Post-Training

Chris: People underestimate the capabilities of base models, particularly for creative tasks.
Base models may require more context engineering but can yield surprising results.
Continued pre-training and supervised fine-tuning (SFT) can effectively surface latent capabilities in base models.
Michelle: Base models are the "absolute most raw form of the intelligence" after pre-training.
Post-training aligns models with human preferences, making them more usable, as demonstrated by ChatGpt.
Base models are hard to use but powerful for users who can master them.
Example: Base models, when asked "How do I ride a bike?" may respond with unhelpful completions such as "How do I drive a car?"

Misunderstandings about Language Model Behavior

Michelle: People mistakenly believe model behavior can be precisely specified and controlled like other products.
Model capabilities are emergent properties of scale, making it difficult to predict outcomes.
Example: GBT-4's unexpected coding proficiency emerged during pre-training.

Improving Model Intuition and Helpfulness

Chris: The system surrounding the model (software, sampling methods) significantly impacts user experience.
A well-designed system can compensate for a relatively weaker model, while a poorly designed system can undermine a powerful model.
Example: A model that fails to produce well-formed code or JSON due to poor sampling leads to a poor user experience, even if the model itself is capable.

Balancing Generalization and Truthfulness

Chris: Truthfulness should be defined as grounding in specific claims or documents, rather than abstract notions of truth.
The desired level of faithfulness depends on the context, ranging from information retrieval to creative brainstorming.
Users should be educated about how model behaviors vary across different tasks.
Michelle: Models should be helpful, usable, and steerable, balancing answering questions with grounding responses in reality.
OpenAI is researching models augmented with tools to verify truthfulness during training.
GBD4.1 and later models are tailored for steerability, allowing developers to control ungrounded inferences and source citations.
Example: Making sources visible in the UI helps users verify information and avoid complacency as models improve.

Encoding Values and Addressing Bias

Chris: Shaping model behavior requires a collective effort spanning data curation, architecture, pre-training, and post-training.
Michelle: Creating a bias-free model is impossible due to differing human priors.
Steerability allows users to encode their own preferences and adapt model behavior.
The "model spec" outlines intended behavior and is open-sourced for community feedback.
The model spec evolves as societal values change.
Chris: Cleansing training data of problematic content (e.g., swear words) can lead to naive models. Models need to acquire social taboos through exposure and instruction.

Unexplored Model Capabilities

Michelle: Enterprise and support use cases (e.g., customer service) are underutilized due to challenges in building reliable surrounding systems.
Challenges include implementing access controls, integrating context, and establishing evaluation methods.
Chris: AI-powered tutoring and legal representation can offer high-quality experiences to those who cannot afford them.
GenAI tools can empower individuals to be more creative and productive, similar to how Tik Tok enabled video editing.

Managing Model Evolution and Redundancy

Michelle: OpenAI's API aims to distribute the benefits of AGI widely and enable developers to build niche applications.
OpenAI invests in core platform primitives (e.g., vector stores, structured outputs) to simplify application development.
Startups should focus on specific verticals and niches, where platform and model advances will make their lives easier.
Chris: Model improvements are outpacing the ability of developers to leverage them effectively.
Focus should be on expressing requirements and building systems around the model.

Improving GenAI Product Usefulness

Michelle: Models should be able to explain themselves and teach users how to maximize their utility.
Successful AI startups prioritize evals to understand their use cases and evaluate system performance.
Chris: Start small with evals, even a dozen cases are better than none.
Software engineers can write unit tests to evaluate model behavior.
Include few-shot examples in prompts to guide model behavior and address edge cases.

Best Practices in Evaluation

Chris: Anything is better than nothing. Use judges and ensure they are well-behaved.
Aggregate information from judges.
Michelle: LLM judges can be effective if criteria are split out and evaluated separately.
Evaluate the judge itself to ensure its reliability.
Chris: Make heavy use of synthetic data to supplement human-created data.
Different models and states can be used to generate diverse inputs and test system robustness.

Overlooked Aspects of Work

Michelle: The complexity of instruction following is often overlooked.
It encompasses format, negative constraints, and content guidelines.
Chris: The interpretability of the best models is underestimated.
Models exhibit systematic, human-interpretable structure, which explains their generalization abilities.

Step-by-Step Processes, Methodologies, or Frameworks Explained

Model Post-Training:
1. Pre-train a base model on a vast corpus of language using next token prediction.
2. Gather feedback from users and developers on desired model behavior and capabilities.
3. Develop internal evals based on production use cases and usability metrics.
4. Use supervised fine-tuning (SFT) and reinforcement learning (RL) algorithms to align the model with human preferences and elicit specific capabilities.
5. Iteratively test and refine the model through alpha testing and feedback loops with developers.
Building Effective Evals:
1. Define the specific use case and desired behavior of the AI system.
2. Create a set of prompts that exercise real flows in the product.
3. Develop a method for grading the model's output, either manually or using an LLM judge.
4. If using an LLM judge, evaluate the judge's performance to ensure its reliability.
5. Continuously run evals during development to track progress and identify areas for improvement.

Key Arguments or Perspectives Presented

Base models are more capable than commonly perceived: They possess latent abilities that can be unlocked through context engineering and post-training techniques.
The system surrounding the model is crucial for user experience: Well-designed software and sampling methods can compensate for a weaker model, while poor design can undermine a powerful one.
Balancing truthfulness and creativity is context-dependent: The desired level of faithfulness depends on the specific task, ranging from information retrieval to brainstorming.
Addressing bias requires a holistic approach: It involves careful data curation, architecture design, and ongoing monitoring and refinement.
Few-shot examples are essential for guiding model behavior: Providing input-output pairs in the prompt can significantly improve the model's ability to follow instructions and address edge cases.
Interpretability is increasing in advanced models: They exhibit systematic, human-interpretable structures that explain their generalization abilities.

Notable Quotes or Significant Statements with Proper Attribution

Chris: "People might underestimate the capabilities of base models, models that have not been post-trained."
Michelle: "ChatGpt was really kind of a magical moment because it was the first time, you know, these models were aligned and and easy to talk to."
Chris: "I think you will not get the right behavior from these models if you sort of cleanse your training data pipeline of all the things you're worried about."
Michelle: "We think these things shouldn't be developed you know in isolation and without feedback from all of our stakeholders which is everyone using these models."
Michelle: "I think the most successful AI startups I see all have one thing in common, which is eval."
Chris: "Even a dozen cases is better than no cases."
Michelle: "There's a place for human data and there's a place for synthetic data. And when you're building your startup, you should just move as quickly as possible and that's likely going to be synthetic data."
Chris: "We do a lot of work in my group on interpretability for models and I think currently people are underestimating just how interpretable the best models are."

Technical Terms, Concepts, or Specialized Vocabulary with Brief Explanations

See "Key Concepts" Section above.

Logical Connections Between Different Sections and Ideas

The discussion flows logically from the initial comparison of base models and post-trained models, highlighting the importance of alignment and steerability. It then delves into the challenges of balancing generalization and truthfulness, and the need for robust evaluation methods. The conversation addresses ethical considerations, such as encoding values and mitigating bias, before exploring potential applications and areas where model capabilities are underutilized. The discussion ends with advice for developers on improving GenAI product usefulness, including the use of few-shot examples and the importance of interpretability.

Data, Research Findings, or Statistics Mentioned

Reference to a state of AI report with MIT, which interviewed 150 or 300 executives and startup leaders about their use of ChatGpt and internal tools.
Mention of categories of instruction following that were identified by OpenAI, such as format following, negative instructions, following ordered instructions, and instructions about content.

Synthesis/Conclusion of the Main Takeaways

The main takeaways from the discussion are that while language models have made significant progress, realizing their full potential requires careful consideration of several factors: the importance of post-training alignment, the need to balance generalization with truthfulness, the ethical implications of encoding values, and the challenges of building reliable surrounding systems. Developers should focus on understanding their use cases, prioritizing evals, and leveraging both human and synthetic data to create useful and trustworthy AI products. Furthermore, the increasing interpretability of advanced models presents opportunities for deeper understanding and control.