Skip to content

svjack/ComfyUI_RH_Ovi

Β 
Β 

Repository files navigation

ComfyUI Ovi Node

A custom node for ComfyUI that integrates Ovi for synchronized video+audio generation from text or image inputs.

✨ Features

  • 🎬 Joint Video+Audio Generation: Generate synchronized video and audio content simultaneously
  • πŸ“ Text-to-Video+Audio: Create videos from text prompts with speech and sound effects
  • πŸ–ΌοΈ Image-to-Video+Audio: Generate videos from image and text inputs
  • ⏱️ 5-Second Videos: 24 FPS, 720Γ—720 area, multiple aspect ratios (9:16, 16:9, 1:1, etc)
  • βš™οΈ Memory Optimization: FP8 precision + CPU offload for 24GB VRAM GPUs
  • πŸš€ Flexible Control: Advanced parameter control for quality fine-tuning

πŸ”§ Node List

Core Nodes

  • RunningHub Ovi Model Loader: Load and initialize Ovi engine with optimization options
  • RunningHub Ovi Text to Video: Generate video+audio from text prompts
  • RunningHub Ovi Image to Video: Generate video+audio from image and text inputs

πŸš€ Quick Installation

Step 1: Install the Node

# Navigate to ComfyUI custom_nodes directory
cd ComfyUI/custom_nodes/

# Clone the repository
git clone https://github.com/HM-RunningHub/ComfyUI_RH_Ovi.git

cd ComfyUI_RH_Ovi

# Install dependencies
pip install -r requirements.txt

# Install Flash Attention
pip install flash_attn --no-build-isolation

Step 2: Download Required Models (modelscope svjack/Ovi)

# Download models (will download to ComfyUI/models/Ovi by default)
python download_weights.py

# Download fp8 quantized model (for 24GB VRAM mode)
cd ../../models/Ovi
wget -O "model_fp8_e4m3fn.safetensors" \
  "https://huggingface.co/rkfg/Ovi-fp8_quantized/resolve/main/model_fp8_e4m3fn.safetensors"
cd ../../custom_nodes/ComfyUI_RH_Ovi

# Final model structure should look like:
# ComfyUI/models/Ovi/
# β”œβ”€β”€ MMAudio/
# β”‚   └── ext_weights/
# β”‚       β”œβ”€β”€ best_netG.pt
# β”‚       └── v1-16.pth
# β”œβ”€β”€ Ovi/
# β”‚   β”œβ”€β”€ model.safetensors
# β”‚   └── model_fp8_e4m3fn.safetensors
# └── Wan2.2-TI2V-5B/
#     β”œβ”€β”€ google/umt5-xxl/
#     β”œβ”€β”€ models_t5_umt5-xxl-enc-bf16.pth
#     └── Wan2.2_VAE.pth

# Restart ComfyUI

πŸ“– Usage

Basic Workflow

[RunningHub Ovi Model Loader] β†’ [RunningHub Ovi Text to Video] β†’ [Save/Preview Video]

image

Prompt Format

Ovi uses special tags to control speech and audio:

  • Speech: <S>Your speech content here<E> - Text will be converted to speech
  • Audio Description: <AUDCAP>Audio description here<ENDAUDCAP> - Describes audio/sound effects

Example Prompt:

<S>Hello world!<E> <AUDCAP>Soft piano music playing<ENDAUDCAP>

Generation Types

Text-to-Video+Audio

  1. Connect RunningHub Ovi Model Loader to RunningHub Ovi Text to Video
  2. Input text prompt with speech and audio tags
  3. Set video dimensions, seed, and generation parameters
  4. Generate synchronized video+audio

Image-to-Video+Audio

  1. Load an image using ComfyUI's Load Image node
  2. Connect image and ovi_engine to RunningHub Ovi Image to Video
  3. Input text prompt with speech and audio tags
  4. Generate video+audio based on the image

Example Prompts

πŸ› οΈ Technical Requirements

  • GPU: 24GB+ VRAM (with CPU offload + FP8 optimization)
    • 32GB+ VRAM without optimization
  • RAM: 32GB+ recommended
  • Storage: ~30GB for all models
    • Ovi models: ~12GB
    • MMAudio: ~2GB
    • Wan2.2-TI2V-5B: ~13GB
    • FP8 quantized model: ~6GB
  • CUDA: Required for optimal performance

⚠️ Important Notes

  • Model Paths: Models must be placed in ComfyUI/models/Ovi/ directory
  • Default Configuration: Model Loader defaults to CPU offload + FP8 for 24GB VRAM
    • Disable both for 32GB+ VRAM (better quality, faster inference)
  • FP8 Model: Required for 24GB VRAM mode (slight quality degradation)
  • All model files must be downloaded before first use

πŸ“„ License

This project is based on the original Ovi project.

πŸ”— References

⭐ Citation

If you find this project useful, please consider citing the original Ovi paper:

@misc{low2025ovitwinbackbonecrossmodal,
      title={Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation}, 
      author={Chetwin Low and Weimin Wang and Calder Katyal},
      year={2025},
      eprint={2510.01284},
      archivePrefix={arXiv},
      primaryClass={cs.MM},
      url={https://arxiv.org/abs/2510.01284}, 
}

About

ComfyUI custom nodes for Ovi joint video+audio generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 96.4%
  • Cuda 2.1%
  • C 1.3%
  • C++ 0.2%