New open Nano Banana, AI plays any video game, new top open source models, long videos: AI NEWS

AI Weekly Roundup: December 2023 - January 2024

Key Concepts:

Vision-Action Foundation Models: AI models trained to perceive and interact with environments, like video games, mimicking human behavior. (e.g., Nvidia Nitrogen)
Diffusion Transformers: Core architecture behind many recent video generation models, known for high-quality output.
LoRA (Low-Rank Adaptation): A technique for fine-tuning large language models with fewer parameters, enabling customization and efficiency.
VRAM (Video Random Access Memory): The memory on a graphics card, crucial for running AI models, particularly those dealing with large datasets like video.
Homography: A transformation that maps points from one plane to another, used in AI Cam for accurate camera movement.
Albido, Normal, Metallic, Roughness: Properties used to define the visual characteristics of surfaces in 3D rendering.
Teleoperation: Remote control of a robot, often mimicking the movements of a human operator.

1. Nvidia Nitrogen: Autonomous Game Playing AI

Nvidia released Nitrogen, a vision-action foundation model capable of autonomously playing a wide variety of video games, including those it hasn’t been specifically trained on. This is achieved through training on 40,000 hours of gameplay from over 1,000 different games. Nitrogen operates by seeing the game (vision component) and then using a controller to take actions (action component), mirroring human gameplay without directly hacking the game code. The training data encompasses genres like action RPGs, platformers, sports, and action-adventure. The code is available on GitHub, allowing users to download and set up the system.

2. Alibaba’s Flashportrait: Infinite Length Portrait Animation

Tong Yi Lab at Alibaba introduced Flashportrait, an AI capable of generating infinitely long portrait animation videos at a speed six times faster than previous methods. The system takes a reference image and a reference video, mapping facial movements from the video onto the new character. Crucially, Flashportrait maintains consistency even in long videos, unlike tools like Live Portrait, Hunan Portrait, Fantasy Portrait, and Juan Animate, which suffer from warping over time. The AI excels at face and head transfer, suitable for creating talking avatars. The code (inference and training) is available on GitHub, with a reported VRAM requirement of approximately 10GB for a 10-second video at 25fps with sequential CPU offload.

3. Generative Refocusing: Post-Capture Focus Adjustment

A new AI model allows for adjusting the focus of a photo after it has been taken. Users can adjust focus forward and backward, or refocus on specific objects or characters within the image. The AI can also adjust the aperture, controlling the depth of field for cinematic effects. The code is available on GitHub, with a relatively small size of 2.6GB, making it accessible on lower-end hardware.

4. Quen ImageEdit 2.511: Enhanced Open-Source Image Editor

Alibaba’s Quen released version 2.511 of Quen ImageEdit, a powerful open-source image editor comparable to Nano Banana. This version boasts improved character consistency and integrates popular LoRAs (like relighting and novel view synthesis) directly into the base model. Users can edit images with natural language prompts, change poses, and apply different styles. Version 2.511 outperforms the previous version, particularly in maintaining consistency during complex edits. Quantized versions are available for lower VRAM systems (e.g., 8GB).

5. AI Cam: Dynamic Camera Movement Generation

AI Cam allows users to alter the camera movement and perspective of existing videos, creating new videos of the same scene. The AI preserves character consistency and scene details during these transformations. The system utilizes a diffusion transformer model combined with a homography-guided self-attention block and a warping module to ensure accurate and smooth camera motion. It outperforms competitors like Recam Master and Trajectory Crafter in recreating scenes with different camera perspectives. The code is available on GitHub, but requires over 50GB of VRAM (48GB for Uni-epth, 28GB for Infam pipeline).

6. UniTree: Advanced Humanoid Robot Teleoperation

A new UniTree demo showcased advanced teleoperation capabilities, allowing a human to remotely control the robot with minimal equipment (no bulky VR headsets). The robot mirrors the human’s full-body movements smoothly and maintains balance. The demonstration highlighted the potential for future applications in physically demanding jobs or remote operation in hazardous environments.

7. Chat LLM (Sponsored): All-in-One AI Platform

Chat LLM by Abacus AI is an integrated platform offering access to various AI models for text generation, image creation, and video editing. It features a Deep Agent for automating complex tasks like creating presentations and reports. Access to all features is available for $10/month.

8. Storymen: Cinematic Video Creation with Story Consistency

Bite Dance released Storymen, an AI capable of creating longer, cinematic videos while maintaining story consistency. The AI generates a video, selects keyframes, stores them in a memory bank, and uses this memory to generate subsequent scenes that align with the established narrative. The code is available on GitHub.

9. RICO: Video Micro-Editing with Text Prompts

RICO (Region Constraint in Context Generation) is described as a "Nano Banana for video," enabling micro-editing of existing videos using text prompts. Users can replace objects, add elements, remove items, or change artistic styles. RICO demonstrates superior performance compared to tools like LucyEdit and Ditto. The benchmark and evaluation code are available on GitHub, with the model and inference code planned for release in 2-3 weeks.

10. Miniax M2.1: State-of-the-Art Open-Source Model

Miniax M2.1 is a new open-source model excelling in agentic coding, multi-step reasoning, and complex task solving. It rivals top closed models like Claude 4.5, Gemini 3 Pro, and GPT 5.2 in benchmarks like Sweepbench and competitive math. Demonstrations included creating a 3D racing game, a beehive simulation, and a comprehensive financial report from a spreadsheet. The chatbot is available online, and the code is open-sourced on Hugging Face (229 billion parameters, ~230GB in size).

11. GLM 4.7: Powerful Agentic Coding and Tool Use

GLM 4.7, also a new open-source model, demonstrates strong capabilities in agentic coding, tool use, and complex reasoning. It rivals top closed models in benchmarks and has been shown to create complex applications like Android OS simulations and fully functional video editors.

12. Spacia: 3D Scene Consistency in Video Generation

Spacia generates videos with consistent scenes by maintaining a 3D memory of the environment. This allows for smooth camera movements and edits within the scene. The code is planned for release.

13. MV Inverse: 3D Scene Reconstruction from Images

MV Inverse reconstructs complete, editable 3D scenes from single or multiple photos, predicting properties like albedo, lighting, and surface orientation. The code is available on GitHub.

14. Dream Montage: Keyframe-Guided Video Creation

Dream Montage allows users to create videos by specifying keyframes at different points in time, with the AI filling in the intermediate frames. The code is not yet released.

15. 3D Regen: Single-Image to Editable 3D Scene

3D Regen converts a single photo of an indoor scene into a complete, editable 3D scene. The code is planned for release in January 2026.

16. Carry 4D: 3D Reconstruction and Interaction Tracking

Carry 4D reconstructs 3D models of humans and objects interacting in a scene, tracking their movements over time. This technology is valuable for training humanoid robots. The code is planned for release in January 2026.

17. Animate Any X: 3D Character Animation in Any World

Animate Any X allows users to place any 3D character into any 3D world and control their actions with text prompts. The code is coming soon.

Conclusion:

The AI landscape continues to evolve at a rapid pace, with significant advancements in open-source models rivaling those of proprietary systems. Key areas of progress include video generation, 3D scene reconstruction, and autonomous agent capabilities. The release of code for many of these models is democratizing access to cutting-edge AI technology, fostering innovation and accelerating development. The trend towards more consistent and controllable video generation, coupled with the increasing power of agentic coding models, suggests a future where AI-powered content creation and automation will become increasingly prevalent.

New open Nano Banana, AI plays any video game, new top open source models, long videos: AI NEWS

AI Weekly Roundup: December 2023 - January 2024

Chat with this Video

Related Videos

Ready to summarize another video?