Qwen-3 Is Here — The Llama-4 We’ve Been Waiting For!

Quen 3 Model Release: Summary

Key Concepts:

Quen 3: New model family from Quen AI, featuring hybrid thinking mode.
Hybrid Architecture: Ability to enable or disable thinking/reasoning on demand.
Four-Stage Learning: Innovative post-training process enabling hybrid thinking.
Mixture of Experts (MOE): Model architecture using multiple specialized sub-networks.
Dense Models: Traditional model architecture without specialized sub-networks.
MCPS: Modular Compositional Planning System (tool calling capabilities).
Context Window: The amount of text a model can consider at once.
VLM/SGLang: Tools for deploying models in production environments.
Chain of Thought (CoT): Step-by-step reasoning process used by the model.
RL: Reinforcement Learning, used for fine-tuning and alignment.

Overview of Quen 3 Models

Quen AI has released Quen 3, a highly anticipated model family that includes both Mixture of Experts (MOE) and dense models. There are eight models in total: two MOE models and six dense models, ranging in size from 6 billion to 235 billion parameters. A key innovation is the hybrid architecture, allowing users to enable or disable thinking/reasoning on demand. The models demonstrate strong performance in coding and agentic capabilities, with native support for MCPS.

Benchmarks and Performance

While users are encouraged to conduct their own tests, initial benchmarks suggest that the larger 235 billion parameter MOE model performs better than OpenAI's GPT-3.5 and is comparable to Gemini 1.5 Pro on some benchmarks, despite having only 22 billion active parameters. The 32 billion parameter dense model is also comparable to GPT-3.5 and outperforms DeepSeek R1 on several key benchmarks. The smaller MOE model (30 billion parameters with 3 billion active) outperforms the previous generation Quen QWEN 32B model. The models and weights are available on Hugging Face, and a chat interface is available at chat.quen.ai.

Model Details and Architecture

All eight models are released under the Apache 2.0 license. The smaller dense models have a context window of 32,000 tokens, while the larger dense models and both MOE models have a context window of 128,000 tokens. The architecture is based on DeepSeek V2. The models were pre-trained at 32,000 tokens and extended to 128,000 tokens through post-training. For production deployments, VLM or SGLang are recommended over tools like Ollama or LM Studio due to higher throughput.

Hybrid Thinking Mode

The hybrid thinking mode is a key feature, allowing users to enable or disable thinking on demand. In "thinking mode," the model uses chain-of-thought reasoning, suitable for complex tasks. In "non-thinking mode," the model provides quick responses for simpler questions. This is controlled by a single hyperparameter. Enabling thinking improves performance on complex problems, as demonstrated by AIM 24 and 25 benchmarks. However, the effectiveness of longer thinking may depend on the nature of the questions.

Capabilities and Applications

The models are multimodal and support 119 languages. They excel in coding and agentic capabilities, with native support for MCPS. A demonstration showed the model extracting markdown content from a GitHub page and creating a bar chart, showcasing its ability to use tools sequentially within a chain of thought.

Training Data and Process

The models were trained on 36 trillion tokens across 119 languages and dialects, using high-quality synthetic data generated by previous generation Quen models (Quen 2.5). The pre-training process consisted of three stages:

Pre-training on 30 trillion tokens with a 4,000 token context window.
Improving the dataset with more knowledge-intensive tasks (STEM, coding, reasoning) and pre-training for an additional 5 trillion tokens.
Using high-quality long context data to extend the context window to 32,000 tokens.

The post-training process consisted of four stages:

Fine-tuning on long chain-of-thought data, focusing on mathematics, coding, logical reasoning, and STEM problems.
Scaling up computational resources for RL to enhance exploration and exploitation capabilities.
Integrating non-thinking capabilities by fine-tuning on a combination of long context chain-of-thought data and instruction tuning data.
Applying general RL across 20+ domains to strengthen capabilities and correct undesired behavior.

The MOEs were distilled into smaller dense models after this process.

Usage and Integration

The models can be used within the Transformers package, with a flag to enable or disable thinking. Support for MCPS is included, allowing for easy integration of tools by providing configuration files. Several companies are already supporting Quen models out of the box.

Conclusion

Quen 3 represents a significant advancement in open-weight models, particularly with its hybrid thinking mode and strong performance in coding and agentic tasks. The release sets a high bar for competitors like DeepSeek and Llama. The innovative four-stage learning process and use of high-quality synthetic data are key factors in the model's success.