Give Your Chat Agent a Voice — Luke Harries, ElevenLabs

By AI Engineer

Share:

Key Concepts

  • Voice Engine: A new primitive designed to wrap existing chat agents with voice capabilities (Speech-to-Text and Text-to-Speech) without requiring a full system rebuild.
  • Agent Orchestration: The backend architecture combining LLMs, RAG (Retrieval-Augmented Generation), and tool-calling integrations.
  • Turn-taking: The process of managing conversational flow, including detecting pauses and emotional context.
  • Omni-channel Interaction: The ability to deploy voice agents across various platforms, including web widgets, phone lines, and live video calls.
  • Proxying: The methodology of routing voice input through a wrapper to an existing chat agent’s logic.

1. The Evolution from Chat to Voice

The speaker argues that while 2025 was the year of the "chat agent" (with companies like Linear and PostHog adopting chat-first interfaces), the future lies in voice.

  • Advantages of Voice: It is faster, more interactive, and significantly more accessible for users with dyslexia or those who struggle with traditional keyboard interfaces.
  • Strategic Shift: The speaker posits that existing chat agents must evolve into voice agents to remain relevant, moving from simple text interfaces to dynamic, multi-modal interactions.

2. The "Voice Engine" Solution

Many developers have already invested significant time in building, evaluating, and refining their chat agents. The speaker introduces Voice Engine to prevent the need for these developers to "rip and replace" their existing infrastructure.

  • Technical Components:
    • Speech-to-Text (STT): Utilizes "Scribe," described as the most accurate model available.
    • Text-to-Speech (TTS): Utilizes V3 models with support for thousands of voices and languages.
    • Advanced Turn-taking: An emotion-aware system that detects natural pauses and semantic context to ensure fluid conversation.
  • Developer Experience: The system is designed for simplicity. Developers use a server SDK to create a "wrapper" around their existing agent. This wrapper proxies voice sessions directly to the existing chat logic, meaning the agent’s original tool-calling and RAG capabilities remain intact.

3. Implementation Framework

The integration process is designed to be minimal:

  1. Server SDK: Initialize the client and the Voice Engine, then attach the wrapper to the existing chat agent.
  2. Client SDK: A three-line implementation to add a voice widget to a website.
  3. UI Components: Pre-built components styled similarly to Shadcn/Vercel for rapid deployment.
  4. Automation: The speaker demonstrated a "one-prompt" approach where an AI agent analyzes an existing codebase, identifies the chat agent, and automatically writes the necessary wrapper code.

4. Handling Tool Calling

A critical concern for developers is how voice affects existing tool-calling logic. The speaker clarifies:

  • Backend Continuity: Because the Voice Engine acts as a proxy, the existing chat agent continues to handle the majority of tool-calling logic on the backend.
  • Front-end Flexibility: 11 Labs supports both client-side and server-side tools, allowing for advanced interactions like manipulating the DOM directly from the voice interface.

5. Real-World Applications

  • Customer Support: Replacing or augmenting traditional support with voice-enabled phone lines.
  • Live Assistance: Agents capable of joining Zoom calls to provide real-time fact-checking or data correction.
  • Accessibility: Providing a more natural interface for users who find text-based chat cumbersome.

Synthesis and Conclusion

The core takeaway is that the industry is moving toward higher-level abstractions. Instead of building voice systems from scratch, developers should focus on "wrapping" their existing, high-quality chat agents with a dedicated voice layer. By treating the voice engine as a first-class primitive, developers can unlock omni-channel capabilities—such as telephony and web-based voice widgets—with minimal code changes. The speaker concludes with a strong prediction: chat agents will either evolve to include voice or they will become obsolete.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video