Voice AI: when is the "Her" moment? — Neil Zeghidour, Gradium AI

By AI Engineer

Share:

Key Concepts

  • Voice AI: The domain of training models for speech-to-text (STT), text-to-speech (TTS), and speech-to-speech (S2S) interactions.
  • Cascaded Systems: Traditional architectures using separate blocks (STT → LLM → TTS).
  • Full Duplex: A communication mode allowing simultaneous speaking and listening, essential for natural human-like conversation.
  • Back-channeling: The linguistic practice of providing feedback (e.g., "mhm," "uh-huh") while another person is speaking to signal active listening.
  • Paralinguistic Understanding: The ability of an AI to interpret tone, pitch, and emotional cues beyond the literal text.
  • On-device AI: Models that run locally on hardware (e.g., smartphone CPUs) rather than in the cloud, ensuring privacy and zero API costs.

1. The Current State of Voice AI

The speaker argues that while voice AI has made significant progress, it remains far from the seamless, human-like interaction depicted in the movie Her. Current systems, even high-end ones like those from 11Labs, suffer from:

  • High Latency: The total stack (understanding, processing, and pronouncing) often exceeds the 200ms threshold required for natural human conversation.
  • Lack of Full Duplex: Most models are "half-duplex," meaning they cannot handle interruptions or simultaneous speech, leading to awkward, robotic interactions.
  • "Glorified Text Models": Current agents are often just LLMs with a voice wrapper, meaning they lack the ability to leverage non-textual cues (tone, emotion, urgency).

2. Methodologies: Cascaded vs. Speech-to-Speech

  • Cascaded Systems: These are the industry standard. While practical and reliable for tool-calling, they are inherently limited by the latency of each individual component. The speaker notes that the "bottleneck" is often the tool-calling latency (500ms to 4s), which makes real-time conversation difficult.
  • Speech-to-Speech (S2S): This approach replaces the three-block stack with a single model. While it reduces latency, most S2S models (excluding the speaker's project, Moshi) are half-duplex.
  • The "Filler" Strategy: To mitigate tool-call latency, the speaker suggests using "fillers"—allowing the LLM to keep the conversation flowing naturally while it waits for a tool to return data.

3. The "Moshi" Framework

The speaker highlights Moshi, a project developed by their lab, as a breakthrough in full-duplex interaction.

  • Key Advantage: It allows for constant overlapping speech, enabling natural back-channeling.
  • Limitations: While it excels at conversational flow, it lacks the "intelligence" and "observability" of cascaded systems. It is difficult to use in production because it lacks robust tool-calling and safety guardrails.

4. Scalability and Economic Challenges

A major argument presented is that current voice AI is economically unsustainable for mass-market consumer apps.

  • The Cost Problem: TTS is the most expensive component of the stack. Many companies are "burning their fundraising" on API bills for voice services.
  • The Solution: Moving toward on-device processing. By running models like Gradian Phonon on a smartphone CPU (under 100 million parameters), developers can eliminate API fees and enhance user privacy by keeping data local.

5. Notable Quotes

  • "The latency is still quite high. The ability to handle simultaneous speaking between the user and the system is not there."
  • "As long as we don't have the same level of reliability, intelligence, and personalization as cascaded systems, I don't see a path towards [S2S models] replacing them."
  • "Voice is very challenging. The last mile is going to be the most difficult to solve."

6. Synthesis and Conclusion

The speaker concludes that the industry is at a crossroads. While the "Her" moment is the goal, the path forward requires a hybrid approach:

  1. Science & Engineering: We must bridge the gap between the natural, full-duplex flow of models like Moshi and the reliability/tool-calling capabilities of cascaded systems.
  2. Economic Viability: The future of voice AI lies in on-device, local processing to solve the dual problems of high API costs and data privacy.
  3. Beyond Commodities: The speaker rejects the notion that voice AI is a "commodity," asserting that the technical complexity of achieving human-level interaction remains a significant, unsolved engineering challenge.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video