Back to all videos

Build Hour: GPT-Realtime-2

By OpenAI

Share:

Key Concepts

GPT Realtime 2: The latest multimodal voice model from OpenAI, featuring GPT-5 class reasoning, parallel tool calling, and improved multilingual performance.
Voice-to-Voice (V2V): An architecture that eliminates the need for separate speech-to-text (STT) and text-to-speech (TTS) steps, resulting in lower latency and more natural interactions.
Parallel Tool Calling: The ability for the model to execute multiple functions simultaneously rather than sequentially, enabling complex agentic workflows.
Context Window: Increased to 128k tokens, allowing for longer, more complex sessions without truncation.
Agentic Behavior: The model’s ability to reason, operate UI elements, and manage state across multiple tools.
VAD (Voice Activity Detection): Technology used to determine when a user has stopped speaking; critical for managing turn-taking in noisy environments.

1. Overview of GPT Realtime 2

OpenAI recently released three new models designed to enhance voice-based applications:

Realtime Translate: Supports 70+ input and 13+ output languages with low-latency streaming.
GPT Realtime Whisper: A streaming model with tunable latency (as low as 200ms) supporting 80 input languages.
GPT Realtime 2: The flagship model bringing GPT-5 class reasoning to voice. It features improved prompt adherence, domain-specific vocabulary (e.g., healthcare, technical terms), and controllable expressiveness (e.g., whispering, excitement).

2. Key Features and Technical Improvements

Reasoning & Intelligence: The model now supports "preambles" (e.g., "Let me check on that") to mimic human hesitation during reasoning.
Context Management: The 128k token window allows for roughly one hour of continuous interaction.
Dynamic Tone Matching: The model can distinguish between multiple speakers and adapt its tone based on instructions.
Developer Control: Developers can disable VAD on a turn-by-turn basis to prevent interruptions during critical disclosures or system messages.

3. Real-World Applications & Demos

E-commerce Shopping Assistant: A demo showed an agent operating a UI via tool use. It could search for products, filter by price/rating, check weather forecasts, and manage a shopping cart.
Product Analytics Dashboard: An agent acted as a "data analyst," filtering complex datasets, identifying root causes for performance drops (e.g., a Safari-specific bug), and summarizing findings for engineering teams.
Customer Service (Sierra Case Study): Sierra uses these models to build enterprise-grade agents. They emphasize that for production, the model is just one component of a larger "agent harness" that handles PCI compliance, redaction, and custom VAD models to filter out background noise (e.g., traffic, children).

4. Methodologies for Production

The "Agent Harness": Sierra emphasizes that raw models are insufficient for enterprise. A production layer must include:
- Custom VAD: To handle messy, real-world audio.
- Tracing & Redaction: For security and compliance.
- Simulation Testing: Replaying realistic customer calls to measure task completion rather than just "sounding human."
Handling Long Sessions: If a session exceeds the one-hour limit, developers should start a new session and "rehydrate" it with the context/state from the previous session.
Escalation Patterns: For highly complex tasks, developers can use an "advisory pattern" where the real-time model handles the conversation, but offloads heavy reasoning to a frontier text model (like GPT-4o/5) when necessary.

5. Notable Quotes

"The challenge is not just can we make the agent sound natural, it's whether we can build, evaluate, constrain, and operate agents that businesses trust to represent them directly with their customers." — Ken Murphy, Sierra
"Voice is pretty unforgiving... a pause of even half a second can feel awkward or broken." — Ken Murphy, Sierra
"Humans often back-channel with 'mhm' and 'aha'... a lot of our voice models are trained to respond to everything, but humans are much better at being selective." — Soham, Sierra

6. Synthesis and Conclusion

The transition from cascaded voice stacks (STT -> LLM -> TTS) to native Voice-to-Voice models like GPT Realtime 2 represents a paradigm shift in AI interaction. The primary takeaway for developers is that intelligence is now conversational. By leveraging parallel tool calling and a larger context window, developers can build agents that act as "chiefs of staff" or "analysts in the loop." However, for production-grade applications, developers must build a robust infrastructure layer—including custom VAD, state management, and rigorous simulation-based evaluation—to ensure reliability, safety, and brand consistency.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video