Open Source AI Voice Is Finally Good!

Chatterbox Turbo: Local, High-Quality Speech Generation

Key Concepts:

Chatterbox Turbo: A new open-weight text-to-speech (TTS) model from Resemble AI, offering high-quality speech generation, zero-shot voice cloning, and multilingual capabilities.
Zero-Shot Voice Cloning: The ability to replicate a voice using only a short audio sample (reference audio).
Paralinguistic Tags: Special tags used to control the tone, emotion, and realism of generated speech (e.g., laughter, emphasis).
CFG (Classifier-Free Guidance) & Exaggeration Weights: Parameters used to fine-tune the style and expressiveness of the generated speech, particularly in the original Chatterbox model.
Perth Watermarking: A technology to identify AI-generated audio from this system.
TTS (Text-to-Speech): The process of converting text into spoken audio.

Introduction & Current Landscape

Recent advancements in open-weight language and code generation models have been significant. However, speech generation from text has lagged behind, with proprietary models from companies like 11 Labs, Cartisia, and Dgrams maintaining a lead. Resemble AI’s Chatterbox Turbo aims to close this gap, offering comparable quality with the benefit of being open-source and locally runnable. The model is released under a permissive MIT license.

Model Variants & Capabilities

Resemble AI released three Chatterbox models:

Chatterbox Turbo (350M parameters): English-only, featuring all advanced capabilities including zero-shot cloning and paralinguistic tag support. Optimized for low latency on GPUs.
Chatterbox Multilingual: Supports audio generation in multiple languages.
Global Chatterbox: Includes built-in exaggeration tuning for stylistic control.

These models can run on both CPU and GPU, with GPU providing significantly faster performance. A key feature is the implementation of Perth watermarking, allowing identification of AI-generated audio from this system.

Zero-Shot Voice Cloning: Examples & Process

Chatterbox Turbo excels in zero-shot voice cloning. The process involves providing a short reference audio sample (prompt) alongside the text to be synthesized. The model then generates speech in the style of the reference voice.

Example: A reference audio clip of someone saying "10-minute break for a shower" was used to clone the voice, resulting in the generated speech: "Hi, it's Jerry. I'm calling to sell you nothing. That's right. Nothing. No features. No." The speaker noted the impressive preservation of both voice and expressiveness. Links to further examples are available.

The presenter recommends at least 10 seconds of high-quality reference audio for optimal cloning results.

Paralinguistic Tags & Emotional Control

The models utilize paralinguistic tags to control the tone and realism of the generated audio. These tags represent sound effects and emotional cues.

Example: A demo using paralinguistic tags generated the speech: "Oh, that's hilarious. Anyway, we do have a new model in store. It's the Skynet T800 series and it's got basically everything including AI integration with chat GPT and all that jazz. Would you like me to get some prices for you?" While the effects aren't always prominent, the voice cloning remains effective.

Setting Up & Running Chatterbox Turbo Locally

The presenter demonstrated setting up Chatterbox Turbo using a Jupyter Notebook. Due to Python version compatibility issues (requiring Python 3.11, while Google Colab defaults to 3.13), the notebook cannot be run directly in Google Colab. However, the code is designed to run on CPU, making it accessible even without a GPU.

Steps:

Install Chatterbox TTS: pip install ChatterboxTTS
Provide Hugging Face Token: Import the login functionality and paste your Hugging Face token.
Import & Load Model: Import the desired model (e.g., TTS Turbo) and load it, specifying CPU or MPS (for Mac OS) execution.
Generate Speech: Provide the text and generate the audio output.

Without a reference audio, the model uses a default voice. The presenter noted difficulty in controlling the default voice selection.

Fine-Grained Control: CFG & Exaggeration Weights

The original Chatterbox model (not Turbo) offers more granular control through CFG and exaggeration weights.

CFG Weights: Inverse values control the adherence to the model's inherent style.
Exaggeration Weights: Adjust the intensity of stylistic features.

Example: Generating the same text with different exaggeration levels produced varying results, from a mild and natural tone to a more exaggerated and stylized delivery.

Cloning Your Own Voice: Demonstration

The presenter demonstrated cloning their own voice using a reference audio clip from a previous video. The process involved generating audio from the system, then using that generated audio as a reference for subsequent speech synthesis.

Example: The reference audio was: "Hello. This is my voice that we will use as a reference for cloning." The cloned output was: "Welcome to the future of AI voice synthesis. This voice was cloned using just a few seconds of reference audio. Pretty incredible, isn't it?"

The presenter noted that paralinguistic tags were not correctly applied when using the Global Chatterbox model, but functioned as expected with the Turbo version.

Comparison to Gemini Text-to-Speech API

The presenter contrasted Chatterbox Turbo’s reliance on paralinguistic tags with the Gemini Text-to-Speech API, which allows for natural language instructions to control effects and emotions. A link to the Gemini API video was provided for further exploration.

Conclusion & Future Outlook

Chatterbox Turbo represents a significant step forward in open-weight speech generation, offering high-quality output, zero-shot voice cloning, and local execution. While the paralinguistic tag system has limitations, the overall capabilities are impressive. The presenter anticipates continued development in speech technologies, with a growing focus from both major labs and open-source projects. The presenter encouraged viewers to share their opinions on the voice cloning accuracy.

Notable Quote:

“The speech output definitely sounds a lot more natural compared to what we have been hearing from other openweight models.” – Presenter, summarizing the quality of Chatterbox Turbo’s output.