Back to all videos

AI co-scientist, AI for DNA, AI NPCs, open-source robots, new Qwen, new video editors: AI NEWS

By AI Search

Generative AI Multimodal AI AI for Science Speech & Language AI. (Wait

Share:

Key Concepts

Multimodal Models: AI systems capable of processing and generating multiple types of data (text, image, video, audio).
Pixel Space vs. Latent Space: Generating images/audio directly in pixel/waveform space (higher quality, more compute) versus compressed latent space (more efficient).
Agentic AI: Systems designed for multi-step reasoning, planning, and autonomous task execution.
World Models: AI that simulates physical environments or game worlds, allowing for interactive control.
Surface Light Field Tokenization (LTO): A method for 3D reconstruction that captures view-dependent appearance (reflections, lighting).
In-Context Learning: The ability of a model to learn or adapt to new tasks without requiring full retraining.

1. Multimodal & Video Generation

Lance (ByteDance): A 3-billion parameter unified model for text-to-video, video editing, and visual understanding. It supports sequential editing (e.g., changing hair, then background, then motion) and can solve visual mazes. Requires 40GB VRAM.
Cog Omni Control: Acts as a "ControlNet for video," allowing users to guide video generation using rough sketches, pose skeletons, or line art combined with reference images.
Flash GRPO: An alignment technique for video models that optimizes human preference training. It uses "iso-temporal grouping" and "temporal gradient rectification" to improve realism and motion accuracy while reducing training time.

2. 3D & World Modeling

LTO (Apple): Uses Surface Light Field Tokenization to create 3D models that preserve view-dependent details like reflections and surface textures, outperforming models like Trellis.
Reactive GWM: A "Reactive Game World Model" where NPCs are steered via high-level strategies (e.g., offense/defense) injected through cross-attention, allowing for controllable game simulations.
Pano World: Generates consistent 3D panoramic house tours from floor plans. It uses a 3D shell to maintain spatial consistency across different rooms and viewpoints.

3. Image Generation

L2P: A diffusion model that removes the VAE and latent space, generating images directly in pixel space. This allows for 4K/8K resolution and higher detail, currently outperforming latent-based models like Qwen and Zimage Turbo.

4. Specialized AI Models

Carbon (DNA Model): A foundation model for biology that processes up to 400,000 DNA base pairs. It predicts genetic sequences and protein structures, claiming to be 275x faster than EVO 2.
Mega ASR: A speech recognition tool trained on 2.6 million samples specifically for "messy" audio (noise, echo, reverb). It shows a 30% performance gain over leading models in difficult acoustic environments.
HYMT2 (Tencent): A multilingual translation family (up to 30B parameters) designed for instruction following. It excels at preserving formatting (JSON, subtitles, code) across 33 languages.
Marlin 2B: A lightweight video-language model (based on Qwen 3.5) that extracts structured data (what happened and when) from videos, performing on par with much larger models like Gemini 2.5 Flash.

5. Robotics & Avatars

Long Cat Video Avatar 1.5 (Meituan): Generates expressive talking avatars from a single reference image and audio. Supports multi-person interaction and various art styles.
Robot Plus+: A heavy-duty, magnetic wall-climbing robot for industrial maintenance (welding, grinding, painting) on ship hulls and chemical tanks, operated via teleoperation.
Hugging Face Humanoid: An open-source, 3D-printed humanoid robot platform (~$2,500) designed to make robotics research and sim-to-real learning accessible.
Uni Tree G1: Demonstrated real-time autonomous movement controlled by natural language voice commands, eliminating the need for pre-programming.

6. Scientific & Audio Research

AI Co-Scientist (Google DeepMind): A multi-agent system that simulates a research team. Agents debate, critique, and refine hypotheses to accelerate scientific discovery in fields like drug discovery.
WaveFlow (Meta): Generates raw audio waveforms directly, skipping latent compression. While it struggles with complex instruments like piano, it is highly effective for sound effects.
Stable Audio 3 (Stability AI): An open-source music generation model supporting up to 6-minute tracks. It includes tools for audio inpainting and LoRA fine-tuning.

7. Notable Mentions

Qwen 3.7 Max: Alibaba’s latest agentic model, optimized for multi-step coding and reasoning. It features vision capabilities and can be embedded into robotic systems.
Qwen 3.5 Live Translate: A real-time translation model that uses visual context (e.g., seeing a product or object) to improve translation accuracy.
Fashion Chameleon: A real-time virtual try-on system for video that swaps garments while maintaining character and motion consistency, achieving 24 FPS on a single GPU.

Synthesis

The current landscape of AI is shifting from simple "chat-based" interactions toward specialized, agentic, and physically-aware systems. The trend toward "pixel-space" generation and "raw waveform" processing indicates a push for higher fidelity, while the emergence of multi-agent systems (like the AI Co-Scientist) and open-source robotics platforms (Hugging Face) suggests that AI is becoming a collaborative partner in both scientific research and physical-world automation. The focus on "messy" real-world data—whether in audio transcription, industrial robotics, or video translation—marks a maturation of these technologies for practical, commercial application.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video