New open-source AI video generator is out! HunyuanVideo 1.5 tutorial

Key Concepts

Hunyan Video 1.5x10cent: A new open-source AI video generator.
Open-Source: Software whose source code is available for anyone to inspect, modify, and enhance.
VRAM (Video Random Access Memory): The memory used by a graphics processing unit (GPU) to store image data and other information needed for rendering. Low VRAM is a common limitation for running AI models.
Uncensored: The model does not have built-in content restrictions.
1 2.2: The current leading open-source video model being compared against.
Image to Video: Generating a video sequence starting from a single input image.
Text to Video: Generating a video sequence from a textual description (prompt).
Loras (Low-Rank Adaptation): Fine-tuned models that can be added to a base model to introduce specific styles, characters, or animations without retraining the entire model.
Physics Understanding: The model's ability to generate realistic physical interactions (e.g., crushing a can).
Camera Movements: The model's capability to interpret and generate specific camera actions like panning, zooming, and tilting.
ComfyUI: A popular graphical interface for running open-source AI image, video, and audio generators.
GGUF Models: Compressed versions of AI models designed to run on hardware with less VRAM.
Super Resolution (Upscaler): A process that increases the resolution of an image or video.
VAE (Variational Autoencoder): A type of generative model used in AI, often for image and video generation.
Text Encoder: A component of AI models that converts text prompts into numerical representations that the model can understand.
Diffusion Models: A class of generative models that learn to create data by gradually removing noise from a random signal.

Hunyan Video 1.5: A New Open-Source AI Video Generator

This video introduces Hunyan Video 1.5x10cent, a new open-source AI video generator that is presented as a strong contender, especially for users with low VRAM. The model is also noted for being uncensored. The video provides a comprehensive overview, including comparisons with the leading open-source model, 1 2.2, and a step-by-step installation guide for offline use.

Key Features and Capabilities of Hunyan Video 1.5

Hunyan Video 1.5 demonstrates several impressive capabilities:

Smooth Motion Generation: The model excels at producing fluid and realistic movements. Examples include a figure skater's spin and a DJ's actions, where anatomical correctness and smooth motion are highlighted.
Aesthetic Quality: It is capable of generating visually appealing cinematic shots and B-roll, such as leaves with dew drops in sunlight.
Improved Text Rendering: Hunyan Video 1.5 shows better performance in rendering text within videos, as demonstrated by a neon sign displaying "Hunyan video 1.5."
Enhanced Physics Understanding: The model exhibits a better grasp of physics, illustrated by a realistic depiction of a soda can being crushed by a hand. This is a significant improvement over the first version.
Multiple Camera Movement Support: A standout feature is the ability to control camera movements directly through prompts. Examples include a slow pan down to reveal a cat and a gradual pull-back and rise to show a desert landscape. The model's ability to shift focus, mimicking professional camera work, is also noted.
Style Versatility: It supports various artistic styles, including anime, retro, and claymation. Examples of a "cakeman" eating himself and a claymation girl are shown.
Image to Video Functionality: Hunyan Video 1.5 natively supports generating videos from an input image, producing cinematic results.

Comparison with 1 2.2

The video presents a direct comparison between Hunyan Video 1.5 and 1 2.2 (specifically version 2.2, as 2.5 was not yet released and its open-source status was uncertain).

Camera Movement and Action: In a scene with a man in a crowded marketplace with explosions, Hunyan Video 1.5 is shown to handle camera rotation and the character's panicked reactions more effectively than 1 2.2, which struggled with speed and realistic movement.
Anatomy: For a figure skater scene, Hunyan Video 1.5 was found to be slightly better in terms of anatomical correctness, although hands and fingers were not always well-defined. 1 2.2 showed more anatomical flaws, particularly in the arms.
High-Action Scenes (Parkour): In a parkour athlete scene, both models had limitations. Hunyan Video 1.5 was criticized for the athlete not performing flips and having occasional hand issues. 1 2.2 was noisier and also lacked proper flips, though it appeared to have higher action. The reviewer considered this a tie, with Hunyan Video 1.5 being slightly more coherent.
Character Recognition: Both models struggled to recognize specific characters like Naruto and One Punch Man, only identifying Spongebob. 1 2.2 generated this animation slightly better. However, the video emphasizes that the open-source nature allows for the use of Loras to introduce specific characters.
Prompt Understanding (Dual Dimensions): For a prompt involving a child climbing a ladder bridging dual dimensions (city and temple), Hunyan Video 1.5 was more coherent, with seamless blending of dimensions and correct moonlight depiction. 1 2.2 simply split-screened the scene and missed the moonlight.
Camera Movement and Text Generation: Hunyan Video 1.5 correctly executed camera movements (pushing in, tilting up) in a kissing couple scene, but failed to generate the overlay text. 1 2.2 failed to execute the camera motion correctly.
Image to Video (High Action): When generating a high-action fight scene from an image, Hunyan Video 1.5 produced faster movements but had messed-up hands and fingers. 1 2.2 was slower but more coherent.
Jiggle Physics: Both models performed well in generating jiggle physics from an input photo, with Hunyan Video 1.5 having more contrast and definition, and 1 2.2 having faster movements.
Dancing Videos: For generating dancing videos from an image, Hunyan Video 1.5 was more cinematic with camera movement but had distorted hands and fingers. 1 2.2 had a static camera, faster dancing, but also noticeable warping and hand errors.
Epic Scenes (Monster Attack): In a monster destroying a city scene, Hunyan Video 1.5 failed to generate a high-action, shaky camera shot, with slow movements. 1 2.2 was higher action but had physically incorrect effects. Hunyan Video 1.5 was deemed more coherent.
Anime Style: Both models could animate characters from an anime image, including hair blowing and head movements. Hunyan Video 1.5's characters also spoke and moved their heads, though mouths were not correct. 1 2.2 also animated hair and head movements, with some talking. This was considered a close call.
Influencer Videos: Both models could generate videos of an influencer talking about a product. The speaking speed was slow in both, but this could be edited in post-processing. The ability to mass-produce such videos for social media is highlighted.

Overall Comparison Summary: The reviewer found the quality to be similar between the two models, but Hunyan Video 1.5 excelled in camera control. Benchmark scores indicated Hunyan Video 1.5 outperformed 1 2.2 in instruction following, visual quality, structural stability, and motion effects for text-to-video. For image-to-video, Hunyan Video 1.5 was better in instruction following, visual quality, and motion effects, while 1 2.2 was better in structural stability and image consistency. User comparisons also showed a higher win rate for Hunyan Video in both text-to-video and image-to-video. Despite benchmarks, the reviewer's initial tests led to a tie.

Technical Specifications and Performance

Video Length: Hunyan Video 1.5 can produce videos of 5 to 10 seconds. Quality deteriorates after 10 seconds.
Model Size: It is an 8.3 billion parameter model, significantly smaller than 1 2.2's 14 billion parameters. This smaller size makes it more efficient and runnable on consumer-grade GPUs.
Resolution Variants: Two variants are available: one for 480p and another for 720p video generation.
Upscaling: A super-resolution enhancement can be used to upscale generated videos to 1080p.

Installation and Usage

The video provides detailed instructions for installing and running Hunyan Video 1.5.

Online Trial

A free account is required to use the online interface on their website.
The interface allows selection between text-to-video and image-to-video, prompt input, aspect ratio selection, and an option to rewrite prompts.

Local Installation with ComfyUI

Prerequisites: ComfyUI must be installed. A video tutorial for ComfyUI installation is recommended.
Updating ComfyUI: It's crucial to update ComfyUI to the latest version to support Hunyan Video 1.5. This is done via the Manager > Update ComfyUI option.
Downloading Workflows: Workflows for text-to-video and image-to-video can be downloaded as JSON files from the provided link.
Loading Workflows: Drag and drop the downloaded JSON files onto the ComfyUI interface to load the pre-built node graphs.
Model Downloads: Several models need to be downloaded and placed in specific ComfyUI folders:
- Text Encoders:
  - Quen 2.5VL text encoder (approx. 9 GB) - goes into comfy UI/models/text encoders
  - byte 5 small file (approx. 400 MB) - goes into comfy UI/models/text encoders
- Diffusion Models:
  - Hunyan Video model for text-to-video (720p, approx. 15.5 GB) - goes into comfy UI/models/diffusion models
  - Optional 1080p SR file (approx. 15.5 GB) - goes into comfy UI/models/diffusion models
- VAE:
  - VAE model (approx. 2.3 GB) - goes into comfy UI/models/VAE
Model Selection in ComfyUI: After downloading, refresh the model list (press 'R') and select the downloaded models in the respective dropdowns within the workflow.
Text-to-Video Workflow:
- Enter text prompt and negative prompt.
- Easy Cache: An optional node to speed up generation at the cost of some quality. Can be activated by pressing Ctrl+B.
- Video Settings: Specify width, height, and length (frames). Frame rate is typically 24 FPS. Length is calculated as seconds * FPS. Batch size determines how many videos are generated simultaneously.
- VAE Tiled/VAE Decode High: Options to speed up generation if it's too slow.
1080p Upscaling Workflow:
- Requires enabling additional nodes (Ctrl+B).
- Download and place the latent upsampler model (approx. 200 MB) into comfy UI/models/latent upscale models.
- Select the downloaded upscaler model in the workflow.
- Configure upscale method, width, height, and FPS.
Image-to-Video Workflow:
- Download the image-to-video workflow JSON.
- Additional Model: Download the clip vision model (approx. 836 MB) into comfy UI/models/clip vision.
- Download the hunen imagetovideo model (approx. 15.5 GB) into comfy UI/models/diffusion models.
- Low VRAM Option: A quantized FP8 version of the image-to-video model is available for lower VRAM systems.
- Upload the input image.
- Specify positive and negative prompts.
- Adjust output dimensions and length.
- The upscaling workflow can also be applied to image-to-video.

Running on Low VRAM (GGUF Models)

For users with less than 14 GB of VRAM, compressed GGUF models are available.
Source: James 7 on Hugging Face provides these models.
Installation: Download the desired GGUF model (e.g., Q4 GGUF for 6 GB VRAM) and place it in comfy UI/models/unit.
Node Replacement: In the ComfyUI workflow, replace the standard diffusion model node with the Unet Loader GGUF node. This requires installing the Comfy UI GGUF custom node from the Manager.
Performance: GGUF models are faster but result in noticeably lower quality, as demonstrated by a messed-up hair effect in the example.

Sponsor Segment: Mango

The video includes a sponsored segment for Mango, a video-to-video creation platform.

Functionality: Transforms and stylizes videos, allowing users to edit them as desired.
Features:
- Apply styles from top image editors to videos.
- Change video characters.
- End-to-end solution with an intuitive timeline.
- Control transformations with ControlNets and advanced settings.
- Render videos up to 1 minute long.
- Full transformation directly within the app, from reference images to final scaling.
Access: Currently in closed beta, with instant access via a provided link and a 50% discount on the pro plan using the code "AI search."

Conclusion and Takeaways

Hunyan Video 1.5 is presented as a significant advancement in open-source AI video generation, particularly for its improved camera control, aesthetic quality, and efficiency for lower VRAM systems. While it competes closely with 1 2.2, its strengths in prompt adherence and camera manipulation make it a compelling option. The detailed installation guide for ComfyUI empowers users to run the model locally. The availability of GGUF models further democratizes access for users with limited hardware. The reviewer concludes that the models are very close in quality, with Hunyan Video 1.5 having an edge in camera control, and encourages viewers to share their own experiences and comparisons.