Stanford CS153 Frontier Systems | Mati Staniszewski from ElevenLabs on The Future of Voice Systems
By Stanford Online
Key Concepts
- Cascaded Architecture: A system design where separate models (Speech-to-Text, LLM, Text-to-Speech) are chained together to perform a task.
- Fused Architecture: A unified model approach where inputs are processed end-to-end without intermediate text steps, often prioritizing low latency.
- Voice AI/Audio AI: The field of generating, transcribing, and manipulating human speech using machine learning.
- Product-Led Growth (PLG): A business strategy where the product itself is the primary driver of customer acquisition and expansion.
- AI Dubbing: The process of translating audio content into another language while preserving the original speaker's voice, intonation, and emotion.
- Distillation Attacks: Attempts to extract knowledge or replicate the capabilities of a proprietary model by querying it and training a smaller model on the outputs.
1. Evolution of 11 Labs and Problem Obsession
Founded by former Google and Palantir employees, 11 Labs was born from the frustration of poor-quality, monotone voice-over experiences in foreign films (specifically in Poland). The founders were "problem-obsessed," initially attempting to build a full AI dubbing pipeline. They realized that the market needed high-quality, natural-sounding Text-to-Speech (TTS) first. By 2022, they focused on the "last mile" of generation—making audio sound human, emotional, and context-aware—rather than trying to innovate on transcription or translation models simultaneously.
2. Technical Frameworks: Cascaded vs. Fused
- Cascaded Approach: Currently favored by 11 Labs for enterprise use cases. It allows for high reliability, modularity, and the ability to insert guardrails at each step (Transcription → LLM → TTS).
- Fused Approach: Offers lower latency (potentially ~300ms) but sacrifices reliability and interpretability.
- Future Outlook: The company is exploring a hybrid model where simple interactions might use fused architectures, while complex, authenticated tasks (like booking a flight) utilize the more reliable cascaded stack.
3. Key Milestones and Research
- 2022: Breakthrough in context-aware TTS, moving away from hard-coded parameters (age, gender) toward abstract, model-defined voice characteristics.
- 2023: Expansion into voice cloning, a voice marketplace, and creative tools for authors.
- 2024: Integration of transcription, LLM translation, and speech generation for high-quality AI localization (e.g., dubbing world leaders like Javier Milei and Narendra Modi).
- 2025: Real-time voice agents capable of detecting and responding to user emotions (e.g., responding in a reassuring tone to a stressed caller).
4. Business Strategy and Scaling
- Revenue Growth: The company scaled to over $430M ARR in 36 months. Growth is driven by a mix of PLG (self-serve) and enterprise deployments.
- Pricing Philosophy: "Never start from the cost of running the model; start from the value delivered to the customer." The goal is to capture roughly one-tenth of the value provided.
- Organizational Structure: Maintaining small, autonomous teams (under 10 people) with high ownership to ensure rapid iteration and decision-making.
5. Safety, Security, and Ethics
- Proactive Moderation: 11 Labs bakes safety into their models, including watermarking and the ability to trace generated content back to the source.
- Voice Authentication: The CEO explicitly advises against using voice as a primary security factor for banking, as it is not sufficiently secure against modern AI synthesis.
- Counter-Offensive Use: The company has explored using voice agents to "troll" scammers, wasting their time and resources as a defensive measure.
6. Notable Quotes
- "Technology adopted by the community will show you use cases that might diffuse to the rest of the world 6, 12, 18 months later." — Mattie, on the importance of being close to developers and creators.
- "You can go further together, especially in a new space like this where often what seems like a competitive project... [is] largely artificial constructs." — Andrew, on the importance of collaboration between AI startups like 11 Labs and Sesame.
7. Synthesis and Conclusion
11 Labs has successfully transitioned from a niche Discord bot to a foundational platform for audio AI. Their success is attributed to a "middle-to-middle" approach—focusing on iterative, high-quality creative tools rather than "end-to-end" black-box solutions. By prioritizing reliability and emotional expressivity, they have positioned themselves as a critical infrastructure layer for businesses. The future of the field lies in the convergence of modalities, the refinement of emotional intelligence in agents, and the establishment of industry-wide standards for safety and watermarking.
Chat with this Video
AI-PoweredLoad the transcript when you're ready to chat so the initial page stays lighter.