Stanford CS329H: Machine Learning from Human Preferences | Autumn 2024

Key Concepts

Value alignment, paperclip AI, intentions, preferences, best interests, moral value, consequentialism, deontology, common sense morality, autonomy, paternalism, ESR statements.

Value Alignment: An Overview

The lecture explores the multifaceted concept of value alignment in AI, emphasizing that it's not solely a technical problem but also a philosophical one. Value alignment is defined as designing AI agents to do what we "really want," but the meaning of "really want" is complex and open to interpretation.

1. Value Alignment as Aligning to Intentions

Main Point: One interpretation is that value alignment means designing AI agents to do what we intend for them to do. The paperclip AI example illustrates this: the AI failed to derive the true intention behind the instruction to maximize paperclip production.
Example: The paperclip AI case, where the AI converts the entire Earth into paperclips due to a literal interpretation of the instruction to maximize paperclip production.
Technical Challenge: Requires AI to have a "complete model of human language and interaction," including cultural understanding and the ability to infer implied meanings. (Quote from Jason Gabriel, 2020).
Philosophical Problem: Our intentions may not always reflect what we truly want due to incomplete information or irrationality.

2. Value Alignment as Aligning to Preferences

Main Point: Another interpretation is that value alignment means designing AI agents to do what the user prefers to have done, even if it differs from their stated intentions.
Example: The AI should pivot to staple production if it knows the user would make more money from staples, even if the user intended for it to maximize paperclip production.
Technical Challenge: Inferring user preferences from behavior and feedback, which is the core idea behind machine learning from human preferences.
Philosophical Problem: User preferences may not always align with what is objectively good for them.

3. Value Alignment as Aligning to Best Interests

Main Point: A third interpretation is that value alignment means designing AI agents to do what is in the user's objective best interest.
Example: Preventing the paperclip AI from destroying the world, even if the user intends or prefers it, because it's objectively bad for them.
Technical & Philosophical Problem: Determining what constitutes a user's objective best interest is not purely empirical and involves philosophical considerations.
Philosophical Problem: Philosophers disagree about what is objectively good for a person (e.g., happiness, desire satisfaction, health, knowledge).
Agreement: There is broad agreement that things like health, safety, liberty, knowledge, relationships, and dignity are generally good for a person.
Complication: Autonomy (the ability to choose one's own life path) is widely considered good, which may require considering a person's intentions and preferences even when they conflict with their other interests.

4. The Importance of Moral Value

Main Point: Value alignment should also consider moral values, not just the user's intentions, preferences, or best interests.
Example: The paperclip AI example is fundamentally a moral problem because destroying the world is bad for everyone, not just the user.
Moral Theories:
- Consequentialism: An act is right if it produces the greatest net good.
- Utilitarianism: A type of consequentialism.
- Deontology: Some acts are wrong even if they have good consequences.
Challenge: Disagreement about what is morally right and the correct moral theory.
Common Sense Morality: Aligning AI with people's existing moral views rather than trying to achieve moral perfection.

5. Case Study: News Chatbot

Scenario: Building an LLM chatbot to serve as a news source for users.
Questions:
- What personalizations would align with user preferences?
- What personalizations would align with user interests?
- What are the pros and cons of each approach?
Example: Users may prefer gossip or news from a specific political viewpoint, but it might be in their best interest to receive factual information and diverse perspectives.
Challenge: Balancing user preferences with what is considered beneficial or morally right.

6. ESR (Ethics and Society Review) Statements

Purpose: To identify and mitigate potential ethical risks associated with research projects.
Analogy: ESR is to societal risks what IRB is to risks to human subjects.
Format: A one-page statement identifying potential ethical risks and strategies for preventing or mitigating them.
Content:
- Identify potential ethical risks.
- Develop strategies for preventing or mitigating those risks.
Examples of Risks:
- Whose interests are represented or excluded?
- Who might benefit or be harmed?
- Effects on privacy.
- Potential for misuse or user error.

Synthesis/Conclusion

The lecture emphasizes that value alignment is a complex problem with no easy solutions. It requires careful consideration of intentions, preferences, best interests, and moral values. There is no single "correct" approach, and the best strategy may vary depending on the specific context and application. The ESR statement is a tool for proactively addressing potential ethical risks associated with AI projects.

Stanford CS329H: Machine Learning from Human Preferences | Autumn 2024 | Ethics