Back to all videos

Stanford CS153 Frontier Systems | Mati Staniszewski from ElevenLabs on The Future of Voice Systems

By Stanford Online

Input: A summary of a video about 11 Labs.Constraint: No broad terms (e.g.Finance Technology").

Share:

Key Concepts

Cascaded Architecture: A system design where separate models (Speech-to-Text, LLM, Text-to-Speech) are chained together to perform a task.
Fused Architecture: A unified model approach where inputs are processed end-to-end without intermediate text steps, often prioritizing low latency.
Voice AI/Audio AI: The field of generating, transcribing, and manipulating human speech using machine learning.
Product-Led Growth (PLG): A business strategy where the product itself is the primary driver of customer acquisition and expansion.
AI Dubbing: The process of translating audio content into another language while preserving the original speaker's voice, intonation, and emotion.
Distillation Attacks: Attempts to extract knowledge or replicate the capabilities of a proprietary model by querying it and training a smaller model on the outputs.

1. Evolution of 11 Labs and Problem Obsession

Founded by former Google and Palantir employees, 11 Labs was born from the frustration of poor-quality, monotone voice-over experiences in foreign films (specifically in Poland). The founders were "problem-obsessed," initially attempting to build a full AI dubbing pipeline. They realized that the market needed high-quality, natural-sounding Text-to-Speech (TTS) first. By 2022, they focused on the "last mile" of generation—making audio sound human, emotional, and context-aware—rather than trying to innovate on transcription or translation models simultaneously.

2. Technical Frameworks: Cascaded vs. Fused

Cascaded Approach: Currently favored by 11 Labs for enterprise use cases. It allows for high reliability, modularity, and the ability to insert guardrails at each step (Transcription → LLM → TTS).
Fused Approach: Offers lower latency (potentially ~300ms) but sacrifices reliability and interpretability.
Future Outlook: The company is exploring a hybrid model where simple interactions might use fused architectures, while complex, authenticated tasks (like booking a flight) utilize the more reliable cascaded stack.

3. Key Milestones and Research

2022: Breakthrough in context-aware TTS, moving away from hard-coded parameters (age, gender) toward abstract, model-defined voice characteristics.
2023: Expansion into voice cloning, a voice marketplace, and creative tools for authors.
2024: Integration of transcription, LLM translation, and speech generation for high-quality AI localization (e.g., dubbing world leaders like Javier Milei and Narendra Modi).
2025: Real-time voice agents capable of detecting and responding to user emotions (e.g., responding in a reassuring tone to a stressed caller).

4. Business Strategy and Scaling

Revenue Growth: The company scaled to over $430M ARR in 36 months. Growth is driven by a mix of PLG (self-serve) and enterprise deployments.
Pricing Philosophy: "Never start from the cost of running the model; start from the value delivered to the customer." The goal is to capture roughly one-tenth of the value provided.
Organizational Structure: Maintaining small, autonomous teams (under 10 people) with high ownership to ensure rapid iteration and decision-making.

5. Safety, Security, and Ethics

Proactive Moderation: 11 Labs bakes safety into their models, including watermarking and the ability to trace generated content back to the source.
Voice Authentication: The CEO explicitly advises against using voice as a primary security factor for banking, as it is not sufficiently secure against modern AI synthesis.
Counter-Offensive Use: The company has explored using voice agents to "troll" scammers, wasting their time and resources as a defensive measure.

6. Notable Quotes

"Technology adopted by the community will show you use cases that might diffuse to the rest of the world 6, 12, 18 months later." — Mattie, on the importance of being close to developers and creators.
"You can go further together, especially in a new space like this where often what seems like a competitive project... [is] largely artificial constructs." — Andrew, on the importance of collaboration between AI startups like 11 Labs and Sesame.

7. Synthesis and Conclusion

11 Labs has successfully transitioned from a niche Discord bot to a foundational platform for audio AI. Their success is attributed to a "middle-to-middle" approach—focusing on iterative, high-quality creative tools rather than "end-to-end" black-box solutions. By prioritizing reliability and emotional expressivity, they have positioned themselves as a critical infrastructure layer for businesses. The future of the field lies in the convergence of modalities, the refinement of emotional intelligence in agents, and the establishment of industry-wide standards for safety and watermarking.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video