AI thought-to-text, Qwen 3.5, Lyria 3, realtime videos, 4D worlds, realtime TTS: AI NEWS
By AI Search
AI Weekly Update: Key Developments & Breakthroughs
Key Concepts:
- Multimodal Models: AI models capable of processing and understanding multiple data types (text, image, video, audio).
- Large Language Models (LLMs): AI models trained on massive text datasets, enabling natural language processing and generation.
- Real-time Video Generation: Generating video content with minimal latency, allowing for interactive experiences.
- Brain-Computer Interface (BCI): Technology enabling communication between the brain and external devices.
- Parameter Count: A measure of the size and complexity of an AI model; higher parameter counts generally indicate greater capacity.
- Token: A unit of text used by LLMs; context window refers to the maximum number of tokens a model can process at once.
- Open-Source vs. Closed-Source: Refers to the availability of the model's code and weights for public use and modification.
- 4D Scene Generation: Creating dynamic 3D scenes from text prompts.
- Vector Graphics: Representing images using mathematical equations, allowing for scalable and editable visuals.
1. Alibaba’s Quen 3.5: A Powerful Multimodal Model
Alibaba has released Quen 3.5, a multimodal model boasting 397 billion parameters, though only 17 billion are active during use for efficiency. It excels in reasoning, coding, agentic abilities, and multimodal understanding. A key feature is its 1 million token context window – one of the largest in the industry – allowing for prompts containing over 700,000 words, a medium-sized codebase, or an hour of video. Benchmarks demonstrate performance comparable to leading closed-source models like GPT, Claude, and Gemini in areas like instruction following, graduate-level science, agentic tasks, document recognition, and video reasoning. Quen 3.5 can process images and videos, enabling tasks like video question answering and PowerPoint presentation creation from video content. Impressively, it can generate complex 3D racing games and front-end website designs. Integration with OpenClaw allows prompting via platforms like WhatsApp and Telegram. The model demonstrates strong spatial awareness, accurately interpreting complex visual scenes (e.g., determining the position of objects relative to a moving vehicle). It can also solve Sudoku puzzles. Quen 3.5 is open-source (87GB in size, requiring substantial computing resources) and accessible online via the Quen chat platform. (“It’s really good at instruction following, also graduate level science questions as well as agentic abilities.”)
2. Anchorwave: Interactive 3D World Generation
Anchorwave, an open-source Genie 3 model, generates interactive 3D videos where users can navigate using WASD or arrow keys. Currently, it produces 81 frames (a few seconds) per iteration, but can be iterated for longer videos. A key strength is maintaining consistency and realism throughout the generated world, preserving scene memory even when the viewpoint changes. It supports both first-person and third-person perspectives. Anchorwave utilizes "local geometric memories" to break down scenes into smaller parts, ensuring coherence, and a "multi-anchor weaving controller" to combine these parts smoothly. Compared to other 3D scene generators, Anchorwave exhibits superior coherence and detail, closely resembling ground truth imagery. The code is available on GitHub, though it's based on Cogvideo X and a one-based example is planned for future release.
3. TinyLlama: Lightweight Multilingual AI
Coher Labs’ TinyLlama is a remarkably small, open-weight model (3.35 billion parameters) capable of translation and response generation in over 70 languages. The TinyLlama Global model supports 67 languages across five regions. Benchmarks show it outperforms similarly sized models in multilingual generation quality and efficiency. Its small size allows it to run on consumer hardware and even mobile phones. It’s accessible for testing on Hugging Face, demonstrating fast response times. (“This is even small enough to run on consumer hardware or even your mobile phone.”)
4. KittenTTS: Ultra-Lightweight Text-to-Speech
KittenTTS is an exceptionally lightweight text-to-speech generator, with models ranging from 14 million to 80 million parameters and a size under 25MB. This allows it to run on CPUs and mobile phones in real-time. Three models are available: 80M, 40M, and 14M. A free online demo is available on Hugging Face, offering various voice options and speed controls. The models and instructions for local installation are available on GitHub. (“This is essentially allows you to even run this on just a normal CPU or even just your mobile phone.”)
5. Talis HC1: Revolutionary AI Chip Performance
Talis has unveiled the HC1 chip, achieving 17,000 tokens per second – 40 times faster than Nvidia’s B200. It’s also 20 times cheaper to build and consumes 10 times less power. The chip hardcodes the Llama 3.1 model directly onto the hardware, enabling real-time response speeds. This approach merges memory and computation, reducing costs and increasing efficiency. However, the chip is currently model-specific (Llama 3.1) and not compatible with other models. (“You don't even need to wait for the sentence to complete.”)
6. Unitree G1: Advanced Humanoid Robot Demonstrations
The Unitree G1 robot showcased impressive acrobatic abilities during the Lunar New Year Spring Festival Gala, including jumps up to 3 meters with flips, single-leg flips, wall running, and intentional “drunk” recovery. These feats require explosive actuator torque, precise balance, and rapid force absorption. Demonstrations also included synchronized nunchuck use and coordinated swarm movements controlled by a centralized system. (“These robots are able to jump like two to 3 m into the air, executing full flips or even multiple sequential flips.”)
7. Louv: High-Resolution Video Generation
Louv generates ultra-high-resolution videos (up to 4K) from text prompts, surpassing other open-source alternatives like LTX2 and 12.2 in detail. The code is expected to be open-sourced soon. (“Look how crisp and clear everything is. You can even see the ripples of the water if you look closely.”)
8. Zuna Thought to Text: Brain-Computer Interface Advancement
Zuna Thought to Text is a 380 million parameter BCI foundation model for EEG data. It aims to analyze brainwave activity (EEG) and translate it into text, though currently focuses on denoising, reconstructing missing data, and enhancing EEG signal resolution. The model weights are available on GitHub (under 2GB), and the long-term goal is direct thought-to-text conversion. (“This is just the start and the long-term vision for this is to eventually of course just take this raw brainwave data and convert it into text on what the person is thinking.”)
9. AudioX: Unified Audio Generation Model
AudioX is a unified model capable of generating audio and music from text, images, and videos. It can create sound effects, music in various styles, and even repair or extend existing audio clips. It outperforms competitors in versatility and performance. The model (under 6GB) is available on GitHub. (“This is truly a unified audio model.”)
10. VetoPix: Vector-Based Image Editing
VetoPix converts images into vector shapes, allowing for precise editing of individual elements. Users can modify shapes, colors, and add new elements. While powerful, it may be overkill compared to text-prompted image editors like Nano Banana Pro and Quinn ImageEdit unless precise shape control is required. The code is coming soon.
11. Seed 2: Bite Dance’s Advanced LLM
Bite Dance released Seed 2, a large language model excelling in multimodal understanding, reasoning, and agentic capabilities. Benchmarks show it competitive with Claude, GPT 5.2, and Gemini 3. It can autonomously operate CAD platforms and perform complex tasks. Access is currently through the Volcano Engine API. (“It’s not even close.”)
12. Google’s Gemini 3.1 Pro & LIIA 3: Enhanced Capabilities & Free Music Generation
Google released Gemini 3.1 Pro, an incremental upgrade topping leaderboards and offering competitive pricing. They also launched LIIA 3, a free music generator capable of creating tracks from text prompts or images, offering various styles and languages. (“This can generate audio and music from a ton of different inputs.”)
Conclusion:
This week witnessed significant advancements across the AI landscape, from powerful multimodal models like Alibaba’s Quen 3.5 and Bite Dance’s Seed 2 to innovative technologies like real-time video generation (Monarch RT) and brain-computer interfaces (Zuna Thought to Text). The trend towards smaller, more efficient models (TinyLlama, KittenTTS) expands accessibility, while breakthroughs in hardware (Talis HC1) promise to accelerate AI performance. The release of free tools like Google’s LIIA 3 democratizes access to creative AI capabilities. The rapid pace of development underscores the transformative potential of AI across diverse applications.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "AI thought-to-text, Qwen 3.5, Lyria 3, realtime videos, 4D worlds, realtime TTS: AI NEWS". What would you like to know?