This is now the best FREE AI text-to-speech! Voice cloning + emotion control + voice design
By AI Search
Quen 3 TTS: Comprehensive Overview & Installation Guide
Key Concepts:
- Quen 3 TTS: A new, free and open-source text-to-speech (TTS) generator developed by Alibaba.
- Voice Cloning: The ability to replicate a voice using a short audio sample (as little as 3 seconds).
- Voice Design: Creating a completely new voice from scratch using a text prompt describing desired characteristics.
- Custom Voice: Utilizing pre-built voices offered by Quen 3 TTS and modifying them with instructions.
- VRAM: Video Random Access Memory – the memory on a graphics card, impacting performance.
- ComfyUI: A graphical user interface (GUI) for running open-source AI models, including Quen 3 TTS, offline.
- X-vector: A numerical representation of a voice's characteristics used in cloning.
- Seed: A value used to initialize the random number generator, allowing for reproducible results.
1. Introduction to Quen 3 TTS
Quen 3 TTS is a state-of-the-art text-to-speech generator released by Alibaba, notable for its power, flexibility, and open-source nature. It excels in voice cloning, voice design, and handling nuanced emotions and languages. The system is designed to be efficient, functioning well even with low VRAM GPUs and generating audio quickly. It surpasses leading models like 11 Labs, Minimax, GPT-4, and Gemini Pro, particularly in multilingual benchmarks.
2. Core Capabilities & Examples
Quen 3 TTS offers three primary functionalities:
- Voice Cloning: The system can replicate a voice with remarkable accuracy using only a few seconds of audio. An example provided uses an 8-second clip of Steve Jobs, which is then used to generate the sentence, “An ideal harmonious society. Humanism is a lighthouse on this way to guide us in case we are getting lost.” The resulting speech is virtually indistinguishable from the original speaker. It can even clone voices speaking different languages, demonstrated by cloning Donald Trump’s voice and having it speak Japanese.
- Voice Design: Users can create entirely new voices by providing a descriptive text prompt. Examples include:
- “Sarcastic, assertive teenage girl with crisp enunciation, controlled volume” resulting in the line, “blah blah blah. We're all very fascinated, Whitey, but we'd like to get paid.”
- “Middle-aged adult, authoritative, confident, and performative” generating, “Older gentleman, 110, maybe 111 years old. Sort of a sirly Elvis thing happening with him. He smiles like this, seen him around.”
- Custom Voice: This allows modification of pre-built voices with specific instructions. For instance, the voice "Ryan" can be instructed to sound "very sad and tearful" when saying, “She said she would be here by noon.” The system accurately reflects the requested emotion.
3. Emotional Expression & Control
Quen 3 TTS demonstrates a strong ability to convey emotion and control the pace and tone of speech. The system can be prompted to:
- Transition between laughter and conversational speech (“Good one. Okay, fine. I'm just going to leave this sock monkey here. Goodbye.”)
- Shift abruptly from neutral acceptance to intense resentment and anger (“Okay. Yeah, I resent you. I love you. I respect you. But you know what? You blew it. and thanks to you.”)
- Vary tone throughout a transcript, starting measured and escalating to forceful.
4. Multilingual Support & Pre-built Voices
Quen 3 TTS supports multiple languages. Currently, it offers nine pre-built voices: three for Chinese, two for English, one for Japanese, one for Korean, and two for different Chinese dialects. A demonstration showcases the system accurately pronouncing phrases in Japanese, Spanish, French, Hindi, and German.
5. Advanced Features: Dual Voice & Equation Reading
- Dual Voice: The system can simulate a conversation between two distinct voices defined by separate prompts (e.g., Lucas and Mia discussing calculus homework).
- Complex Transcript Handling: Quen 3 TTS can even attempt to read complex transcripts, such as mathematical equations ("X= - or minus the<unk> of B^ 2 - 4 A / 2 A"), though the results may be imperfect.
6. Technical Specifications & Performance
- Model Sizes: Two variants are available: a 1.7 billion parameter model (higher quality) and a 0.6 billion parameter model (faster performance).
- VRAM Requirements: The 0.6 billion parameter model requires less than 4GB of VRAM, making it accessible on consumer-grade GPUs. The 1.7 billion parameter model is also relatively lightweight, under 4GB.
- Benchmark Results: Quen 3 TTS consistently outperforms other leading TTS models (11 Labs, Minimax, GPT-4, Gemini Pro) in various benchmarks, particularly in multilingual tasks.
7. Installation & Setup with ComfyUI
The video provides a step-by-step guide to installing and running Quen 3 TTS locally using ComfyUI:
- Clone the Quen 3 TTS repository from GitHub using ComfyUI’s custom nodes feature.
- Install required dependencies using a Python command (provided in the video description/pinned comment).
- Update ComfyUI to the latest version.
- Load the Quen 3 TTS workflow from the ComfyUI templates.
- Configure the workflow:
- Voice Cloning: Upload a reference audio clip and optionally provide the transcript. Set "X vector only" to
trueif no transcript is provided. - Custom Voice: Select a pre-built voice and adjust parameters.
- Voice Design: Enter a descriptive prompt for the desired voice.
- Voice Cloning: Upload a reference audio clip and optionally provide the transcript. Set "X vector only" to
- Run the workflow to generate audio. The first run will automatically download necessary models.
8. Abacus AI Chat LLM Integration (Sponsored Segment)
The video briefly promotes Chat LLM by Abacus AI, an all-in-one platform for accessing various AI models (text, image, video) and features like Deep Agent for automated tasks.
9. Notable Quote
“This is definitely one of the best text-to-speech generators out there right now.” – The video creator, summarizing the capabilities of Quen 3 TTS.
10. Conclusion
Quen 3 TTS represents a significant advancement in text-to-speech technology. Its open-source nature, combined with its powerful voice cloning, design, and emotional expression capabilities, makes it a compelling option for developers and users alike. The relatively low hardware requirements and straightforward installation process (especially with ComfyUI) further enhance its accessibility. The system’s ability to handle multiple languages and complex transcripts solidifies its position as a leading TTS solution.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "This is now the best FREE AI text-to-speech! Voice cloning + emotion control + voice design". What would you like to know?