A custom node for ComfyUI that integrates Ovi for synchronized video+audio generation from text or image inputs.
- π¬ Joint Video+Audio Generation: Generate synchronized video and audio content simultaneously
- π Text-to-Video+Audio: Create videos from text prompts with speech and sound effects
- πΌοΈ Image-to-Video+Audio: Generate videos from image and text inputs
- β±οΈ 5-Second Videos: 24 FPS, 720Γ720 area, multiple aspect ratios (9:16, 16:9, 1:1, etc)
- βοΈ Memory Optimization: FP8 precision + CPU offload for 24GB VRAM GPUs
- π Flexible Control: Advanced parameter control for quality fine-tuning
- RunningHub Ovi Model Loader: Load and initialize Ovi engine with optimization options
- RunningHub Ovi Text to Video: Generate video+audio from text prompts
- RunningHub Ovi Image to Video: Generate video+audio from image and text inputs
# Navigate to ComfyUI custom_nodes directory
cd ComfyUI/custom_nodes/
# Clone the repository
git clone https://github.com/HM-RunningHub/ComfyUI_RH_Ovi.git
cd ComfyUI_RH_Ovi
# Install dependencies
pip install -r requirements.txt
# Install Flash Attention
pip install flash_attn --no-build-isolation# Download models (will download to ComfyUI/models/Ovi by default)
python download_weights.py
# Download fp8 quantized model (for 24GB VRAM mode)
cd ../../models/Ovi
wget -O "model_fp8_e4m3fn.safetensors" \
"https://huggingface.co/rkfg/Ovi-fp8_quantized/resolve/main/model_fp8_e4m3fn.safetensors"
cd ../../custom_nodes/ComfyUI_RH_Ovi
# Final model structure should look like:
# ComfyUI/models/Ovi/
# βββ MMAudio/
# β βββ ext_weights/
# β βββ best_netG.pt
# β βββ v1-16.pth
# βββ Ovi/
# β βββ model.safetensors
# β βββ model_fp8_e4m3fn.safetensors
# βββ Wan2.2-TI2V-5B/
# βββ google/umt5-xxl/
# βββ models_t5_umt5-xxl-enc-bf16.pth
# βββ Wan2.2_VAE.pth
# Restart ComfyUI[RunningHub Ovi Model Loader] β [RunningHub Ovi Text to Video] β [Save/Preview Video]
Ovi uses special tags to control speech and audio:
- Speech:
<S>Your speech content here<E>- Text will be converted to speech - Audio Description:
<AUDCAP>Audio description here<ENDAUDCAP>- Describes audio/sound effects
Example Prompt:
<S>Hello world!<E> <AUDCAP>Soft piano music playing<ENDAUDCAP>
- Connect
RunningHub Ovi Model LoadertoRunningHub Ovi Text to Video - Input text prompt with speech and audio tags
- Set video dimensions, seed, and generation parameters
- Generate synchronized video+audio
- Load an image using ComfyUI's
Load Imagenode - Connect image and
ovi_enginetoRunningHub Ovi Image to Video - Input text prompt with speech and audio tags
- Generate video+audio based on the image
- Text-to-Video: See example_prompts/gpt_examples_t2v.csv
- Image-to-Video: See example_prompts/gpt_examples_i2v.csv
- GPU: 24GB+ VRAM (with CPU offload + FP8 optimization)
- 32GB+ VRAM without optimization
- RAM: 32GB+ recommended
- Storage: ~30GB for all models
- Ovi models: ~12GB
- MMAudio: ~2GB
- Wan2.2-TI2V-5B: ~13GB
- FP8 quantized model: ~6GB
- CUDA: Required for optimal performance
- Model Paths: Models must be placed in
ComfyUI/models/Ovi/directory - Default Configuration: Model Loader defaults to CPU offload + FP8 for 24GB VRAM
- Disable both for 32GB+ VRAM (better quality, faster inference)
- FP8 Model: Required for 24GB VRAM mode (slight quality degradation)
- All model files must be downloaded before first use
This project is based on the original Ovi project.
If you find this project useful, please consider citing the original Ovi paper:
@misc{low2025ovitwinbackbonecrossmodal,
title={Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation},
author={Chetwin Low and Weimin Wang and Calder Katyal},
year={2025},
eprint={2510.01284},
archivePrefix={arXiv},
primaryClass={cs.MM},
url={https://arxiv.org/abs/2510.01284},
}