Building Effective Voice Agents — Toki Sherbakov + Anoop Kotha, OpenAI

Key Concepts

Audio Agents: Voice-based applications leveraging speech-to-speech models for interactive experiences.
Chained Architecture: Traditional approach of stitching together separate models for transcription, LLM intelligence, and text-to-speech.
Speech-to-Speech Architecture: Emerging pattern using a single model (e.g., OpenAI's real-time API) to perform transcription, intelligence, and speech output.
Real-time API: OpenAI's single model for speech-to-speech, enabling low-latency audio agent experiences.
Tool Use: Integrating external tools and APIs into audio agents to extend their capabilities and access external data.
Evals & Guardrails: Evaluation metrics and safety mechanisms to ensure audio agent accuracy, reliability, and responsible behavior.
Multimodal Era: The shift towards AI models that can understand and generate content across multiple modalities, including text, images, video, and audio.

Building Practical Audio Agents

Introduction

The presentation focuses on the current state and best practices for building practical audio agents, highlighting the advancements in speech-to-speech technology and emerging architectural patterns. It emphasizes the shift from text-based GenAI applications to multimodal experiences incorporating audio, images, and video.

The Evolution of Audio Models

Past Limitations: Audio models were previously slow, robotic, and brittle, making it difficult to build high-quality production applications.
- Example: A demonstration of an older model (less than six months prior) showed significant latency, robotic voice, and inability to handle interruptions.
Current Capabilities: Modern speech-to-speech models are now faster, more expressive, and more accurate, enabling the creation of reliable and engaging audio agent experiences.
- Example: A demonstration of a current model showcased low latency, emotional expressiveness, and the ability to understand and respond to interruptions.
Tipping Point: The models are now "good enough" to build scalable production applications.

Architectural Patterns

Chained Architecture (Traditional):
- Involves stitching together three separate models: audio-to-text (transcription), LLM (intelligence), and text-to-speech.
- Problems: Slower processing time, semantic loss due to multiple conversions.
Speech-to-Speech Architecture (Emerging):
- Uses a single model (e.g., OpenAI's real-time API) to perform transcription, intelligence, and speech output.
- Benefits: Simplified architecture, reduced latency, maintains semantic understanding across conversations.

Key Trade-offs and Considerations

When building audio agents, it's crucial to consider the trade-offs across these five areas:

Latency: The speed of response.
Cost: The expense of running the model.
Accuracy and Intelligence: The correctness and depth of understanding.
User Experience: The overall quality and engagement of the interaction.
Integrations and Tooling: The ability to connect with external systems and data sources.

Consumer-Facing Applications:
- Prioritize user experience and low latency.
- Cost is less of a concern.
- Accuracy is less critical than expressiveness.
- Simple integrations.
- The real-time API is well-suited.
Customer Service Applications:
- Prioritize accuracy and intelligence.
- Integrations and tooling are crucial.
- User experience is still important, but can be slightly compromised for accuracy.
- Latency is less critical than accuracy.
- Cost savings are a major driver.
- The real-time API may be suitable for low-latency needs, but the chained architecture may be preferred for determinism and accuracy.

Agent Definition

An agent is defined as a model, a set of instructions (prompts), the tools given to the model, and the runtime environment (guardrails and execution).

System Design

Delegation through Tools: Use the real-time API as a frontline agent for common interactions and delegate complex tasks to specialized agents powered by smarter models.
- Example: A customer service agent uses the real-time API for initial interactions but delegates a return request to a more specialized agent using a smaller model (e.g., GPT-4 Mini).
Video Example: A demonstration shows a customer returning a snowboard. The real-time API agent handles the initial request, then delegates to a smaller model to process the return policy.

Prompting and Customization

Controlling Expressiveness: In voice-based applications, prompts can control not only the instructions but also the expressiveness of the voice (demeanor, tone, enthusiasm).
Few-Shot Examples: Mimic the use of few-shot examples in text-based prompting by providing examples of desired conversation flows (greeting, description, instructions).
Website Resource: A website is mentioned as a resource for experimenting with different voices and sample prompts.

Tool Use

Start Simple: Begin with a limited number of tools per agent and gradually add more as needed.
Handoffs: When transferring a conversation between agents, summarize the conversation state to maintain context.
Delegation: Utilize models like GPT-3 Mini or GPT-4 Mini for tool calls, as natively supported in the Agents SDK.

Evals and Guardrails

Evals:
1. Observability: Ensure comprehensive logging and tracing of agent behavior.
2. Human Labeling: Use human annotators to label data and iterate on prompts.
3. Transcription-Based Evals: Employ traditional evals based on transcriptions, including rubric-based assessments of function calling performance.
4. Audio-Based Evals: Use audio models (e.g., GPT-4 Audio) to evaluate tone, pacing, and intonation.
5. Synthetic Conversations: Simulate conversations between agents and customer personas to generate evaluation data.
Guardrails:
- Run guardrails asynchronously due to the faster text generation speed of the real-time API.
- Control the debounce period (e.g., running guardrails every 100 characters).

Learnings from the Field

Lemonade (AI Insurance): Focused early on eval guardrails and feedback mechanisms, even if initially unscalable.
Tinder: Prioritized customization and brand realism for their RZ chat experience.

Conclusion

The presentation concludes by reiterating the emergence of the multimodal era and the advancements in real-time speech-to-speech technology. It encourages the audience to explore and build with these technologies, emphasizing that the current state of the models is "good enough" to create scalable production applications. A new snapshot of the real-time API was released, making it an opportune time to build.