AI videos with sound, edit 3D models, animate anyone, realtime voices, new best TTS - AI NEWS

Key Concepts

3D Model Editing: Micro-editing specific parts of 3D objects while preserving the rest.
Spatial Arrangement in Image Generation: Improving the placement of objects within AI-generated images.
Character and Style Transfer: Transferring characters and styles between images using AI.
Text-to-Speech with Expressions: Generating realistic speech with emotions and multiple speakers.
AI Video Generation: Creating videos from text or images with increasing realism and control.
Autonomous Game Playing: AI agents playing and completing games with minimal steps.
Multimodal AI: AI models that understand both images and text.
Lip-Sync and Animation: Animating images with realistic lip movements synced to audio.
Physical Simulation: Simulating the physical properties and movements of objects in 3D.
Robotics: Humanoid robots with advanced sensing and coordination capabilities.
Video-to-Audio Generation: Creating sound effects that match the events in a video.
Image-and-Audio-to-Video Generation: Creating videos from a reference image and audio input.
Realtime Voice AI: AI models designed for low-latency, real-time conversations.

3D Model Editing with Vox Hammer

Main Idea: Vox Hammer is an AI tool for micro-editing 3D models.
Process:
1. User specifies the region to edit using a mask.
2. User provides a text prompt or reference image describing the desired edit.
3. The AI changes only the specified area, keeping the rest of the object consistent.
Examples:
- Changing a crab shell to stone using the prompt "stone crab."
- Adding bread to an empty bowl.
- Changing apples in a bowl to oranges.
- Changing swords into roses using a reference image.
Technical Details:
- Uses "part-aware object editing" to segment the model into meaningful parts.
- Requires an Nvidia GPU with at least 40 GB of VRAM (80 GB recommended).
Availability: Code and models are available on GitHub.

Improving Spatial Arrangement in Image Generation with Compass

Main Idea: Compass is a fine-tune (LoRA) that improves the spatial arrangement of objects in AI-generated images.
Problem: Existing image generators often fail to accurately place objects relative to each other as specified in the prompt.
Solution: Compass helps the AI model understand and implement spatial relationships.
Examples:
- Generating a bird below a skateboard (instead of on a skateboard).
- Generating a bear to the right of a truck.
- Generating a laptop above a dog.
Integration: Can be added on top of existing image generators like Flux and Stable Diffusion.
Performance: Significantly improves benchmark scores (e.g., 98% and 130% improvement on certain benchmarks).
Availability: Code and models are available on GitHub for various image generators.

Character and Style Transfer with USO (ByteDance)

Main Idea: USO is a free and open-source image generator by ByteDance that excels at character and style transfer.
Capabilities:
- Generating images of a character in different settings based on a reference image and text prompt.
- Copying the style of an existing design.
- Mixing the styles of two reference images.
- Combining a character reference image with a style reference image.
Performance: Outperforms existing methods like Infinite U in character transfer and Omni Style/Style ID in style transfer.
Demo: Online demo available on Hugging Face.
Local Use: Code and models are available for local use, including a Gradio demo.
VRAM Requirement: FPH mode requires only 16 GB of VRAM.

Microsoft Vibe Voice: Powerful Text-to-Speech Generator

Main Idea: Vibe Voice is a text-to-speech generator that supports multiple speakers, long generations, expressions, and emotions.
Features:
- Supports up to four distinct speakers.
- Generates audio up to 90 minutes long.
- Automatically handles expressions and emotions based on the transcript.
- Can spontaneously sing.
- Supports switching between different languages within the transcript.
- Can include background music for some voices.
Technical Details: Uses a continuous speech tokenizer to process long sequences of text.
Demo: Online demo available on Hugging Face.
Local Use: Code and models are available on GitHub.
Model Variants:
- 0.5 billion parameter model (small, for real-time streaming - not yet released).
- 1.5 billion parameter model (efficient, longer context length).
- 7 billion parameter model (higher quality, shorter generation length).

Waiver 1.0 (ByteDance): New AI Video Generator

Main Idea: Waiver 1.0 is a new AI video generator by ByteDance capable of both text-to-video and image-to-video generation.
Capabilities:
- Generates videos of 5 or 10 seconds.
- Supports 720p and 1080p resolutions.
- Generates coherent videos with cinematic camera movements.
- Accurately simulates physics (e.g., water splashing).
- Supports specifying different scenes within a generation.
- Can create split-screen videos.
Ranking: Ranked number three for both text-to-video and image-to-video on the Artificial Analysis leaderboard.
Availability: Technical report and architecture available on GitHub, but open-source status is unclear. Can be tried for free on their Discord server (Random Research Lab).

GPT5 Plays Pokémon Crystal

Main Idea: GPT5 autonomously played and completed Pokémon Crystal in a record-breaking 9,500 steps.
Performance: Significantly faster than other models (e.g., OpenAI's 03 needed over 27,000 steps).
Strengths: Planning, spatial reasoning, and decision-making.
Strategy: Efficient path optimization and focus on using a single Pokémon in battles.
Availability: The entire journey can be viewed on a Twitch live stream.

Mini CPM V4.5: Tiny but Powerful Multimodal Model

Main Idea: Mini CPM V4.5 is a multimodal model (vision and language) that outperforms larger proprietary models like GPT40 and Gemini in certain tasks.
Capabilities:
- Image analysis and real-world understanding.
- Extracting information from images (e.g., handwritten notes, tables, complex formulas).
- OCR (Optical Character Recognition).
Performance: Achieves higher scores on various benchmarks compared to GPT40 and Quen 2.5VL.
Size: Only 8 billion parameters, making it super tiny and efficient.
Availability: Models are available on Hugging Face, including GGUF and quantized versions. Code and instructions for local use are available on GitHub.

Chat LLM by Abacus AI

Main Idea: Chat LLM is an all-in-one platform for using various AI models, image generators, and video generators.
Features:
- Seamlessly switch between different models.
- Integrated platform for AI models, image generators, and video generators.
- Artifacts feature for previewing generations side-by-side.
- Deep Agent feature for complex tasks like creating PowerPoints, websites, and research reports.
Pricing: $10 per month for access to all features.

Omnihuman 1.5 (ByteDance): Realistic Lip-Sync and Animation

Main Idea: Omnihuman 1.5 is an AI tool that creates lip-synced videos from an image and audio track with dynamic and natural movements.
Improvements: Offers more dynamic and natural movements compared to version 1.
Control: Allows users to enter an optional text prompt to control the movement of the video further.
Capabilities:
- Realistic lip-syncing.
- Conveying emotions effectively.
- Animating animals.
- Controlling camera movement.
Availability: Research paper released, but open-source status is unclear. Version 1 is available on their Dreaming platform.

Pixie: Generating Physically Correct 3D Scenes

Main Idea: Pixie generates 3D scenes that are physically correct based on a few images of an object at multiple angles.
Capabilities:
- Simulates how an object would move and interact in real life.
- Guesses material properties (stiffness, density, Poisson's ratio).
Performance: Simulates physical properties more accurately and is a thousand times faster than other methods.
Process: Maps visual features to physical properties using a neural network and then simulates movement.
Availability: GitHub repo with instructions for installation and training.

Humanoid Robot News: Alex and Unitree G1

Alex (WI Robotics):
- Humanoid robot with whole-body force sensing.
- Fingertip repeatability of less than 0.3 mm.
- Demonstrated the ability to pull out a specific component from a chip.
Unitree G1:
- Humanoid robot that can play ping pong.
- Maintained a rally for over 100 shots against a human.

Hunyen Video Fully (Tencent): Video-to-Audio Generation

Main Idea: Hunyen Video Fully generates high-quality sound effects that match the events in a video.
Process: Takes in a video and a text prompt describing the video and outputs a matching audio track.
Performance: Outperforms other video-to-audio generators like MM Audio in terms of audio quality.
Availability: Free Hugging Face space for online use. GitHub repo with instructions for local use (requires 20 GB VRAM for inference, 24 GB recommended).

One S2V (Alibaba): Image-and-Audio-to-Video Generation

Main Idea: One S2V creates a video from an image and an input audio, making the characters in the video move, talk, and express emotions realistically.
Control: Allows users to control the movements of the video using a text prompt.
Performance: Scores highest on benchmarks for video quality, expression, authenticity, and identity consistency compared to other lip-sync tools.
Availability: GitHub repo with instructions for local use (requires 80 GB VRAM for the 14 billion parameter model).

OpenAI GPT Realtime: New Realtime Voice AI

Main Idea: GPT Realtime is OpenAI's latest speech-to-text AI model designed for real-time, low-latency conversations.
Applications: Customer service, support, and sales calls.
Claimed Improvements: More natural and expressive than the previous GPT40 voice.
Availability: Available via API and on their playground.

Conclusion

This week in AI has seen significant advancements across various domains, including 3D model editing, image generation, video generation, text-to-speech, robotics, and multimodal AI. Key highlights include ByteDance's new video generator Waiver 1.0, Alibaba's One S2V for creating videos from images and audio, and the impressive performance of the small multimodal model Mini CPM V4.5. These developments showcase the rapid pace of innovation in AI and its potential to transform various industries.