Skip to content

Goldlionren/ComfyUI_JR_DreamID-V

Repository files navigation

ComfyUI_JR_DreamID-V

ComfyUI Plugin License Python

ComfyUI_JR_DreamID-V is a JR-maintained fork of πŸ‘‰ HM-RunningHub / ComfyUI_RH_DreamID-V.

This project is based on DreamID-V and provides high-fidelity video face swapping inside ComfyUI. Unlike the original fork which primarily targets high-end GPUs (RTX 4090 / 5090), this JR fork focuses on broader hardware accessibility and stability.


✨ Goals of the JR Fork

Make DreamID-V usable for more people, not only top-tier GPUs.

This fork is designed to:

  • βœ… Support 16GB VRAM GPUs (e.g. RTX 4060 Ti, RTX 4080)
  • βœ… Support dual-GPU setups (e.g. T5 encoder on a second GPU)
  • βœ… Support CPU / GPU mixed offloading
  • βœ… Reduce OOM issues on mid-range hardware
  • βœ… Preserve full compatibility with existing RunningHub workflows
  • βœ… Can run 720x1280 FPS16 5s face swap on RTX5060Ti 16G

✨ Features

  • 🎭 High-Fidelity Face Swapping Video face swapping powered by Diffusion Transformer

  • 🎬 Video-Driven Motion Use a video as the motion / pose driver

  • πŸ–ΌοΈ Reference Image Identity Single face image as identity reference

  • πŸ”§ Native ComfyUI Integration Seamlessly integrated into ComfyUI workflows

  • 🧠 Low-VRAM Friendly (JR Fork) Flexible device placement for T5 / main model / VAE

  • πŸ•Ί DWPose (ONNX, GPU-Accelerated) Pose Extraction (JR Fork)
    Replaces legacy MediaPipe pose extraction with DWPose (ONNXRuntime).
    Supports CUDA / TensorRT acceleration, significantly improving stability and accuracy on complex motion videos.

  • πŸ•Ί Wan_Faster version (JR Fork)
    Replaces legacy MediaPipe pose extraction with DWPose (ONNXRuntime).
    introducing DreamID-V-Wan-1.3B-DWPose. This significantly improves stability and robustness in pose extraction.



English Update (Release Notes)

JR_LoadVideoPlus Node Updates

  • Added target_long_edge: scales by the longer edge (in pixels) while preserving the original aspect ratio, eliminating manual width/height calculations.

  • Kept VIDEO output unchanged: remains drop-in compatible with the existing sampler pipeline (no downstream changes required).

  • Added 3 extra output ports for easier downstream wiring and dynamic parameter linkage:

    • out_w: final output width
    • out_h: final output height
    • out_fps: final output FPS
  • Normalized out_fps to an integer (int, using round) for improved consistency and downstream compatibility.

Note: When target_long_edge is set (>0), it takes priority and the shorter edge is computed automatically. scale_w/scale_h only applies when target_long_edge is disabled.


πŸ“‹ Nodes

This plugin provides two sets of nodes, with full backward compatibility.

βœ… JR Nodes (Recommended)

Node Name Description
JR_DreamID-V_Loader Load DreamID-V pipeline (device-selectable)
JR_DreamID-V_Sampler Run video face swapping
JR_DreamID-V_LongVideo_Sampler Run long video face swapping via chunking (recommended for long videos)

πŸ” Legacy Nodes (Compatibility)

Node Name Description
RunningHub_DreamID-V_Loader Legacy loader (for old workflows)
RunningHub_DreamID-V_Sampler Legacy sampler

πŸ’‘ New workflows should use JR nodes. Existing workflows will continue to work without modification.


⚑ DreamID-V Wan-Faster Backend (JR Integrated)

This fork integrates DreamID-V Wan-1.3B-Faster as an optional backend, providing significantly faster inference with reduced sampling steps, while maintaining identity fidelity.

Recommended Settings (Wan-Faster)

Parameter Recommended Notes
backend wan_faster Must be explicitly selected in Sampler
sampling_steps 12 Faster model is trained for short schedules
fps (LongVideo) 16 Strong speed / quality balance
sample_solver unipc Required (others not supported)
frame_num 81 Same as standard DreamID-V
overlap_frames 8–12 Recommended for long videos

⚠️ Using higher sampling steps (e.g. 20+) with wan_faster provides no quality benefit and only increases runtime.


🎞️ Long Video (Chunked) Sampler (JR Enhancement)

JR_DreamID-V_LongVideo_Sampler is designed for long videos that would otherwise OOM when processed as a single clip. It splits the input into chunks, processes each chunk sequentially, writes intermediate frames to disk, and finally merges them into a single output video.

Key Parameters

  • frame_num: Chunk size in frames.
    Example: total 1620 frames, frame_num=81 β†’ 20 chunks.

πŸ’‘ Wan-Faster Recommendation:
When using backend=wan_faster, fps=16 is strongly recommended for optimal speed/quality tradeoff.

  • fps (LongVideo only):

    • -1 (default): follow source video FPS (no resampling)
    • >= 1: time-based FPS resampling before pose/mask and inference
      • Keeps the original duration (no slow-motion)
      • Reduces the number of frames processed by the model
      • Significantly improves speed for high-FPS sources (e.g., 60 β†’ 24)
  • overlap_frames: Warm-up overlap for temporal stability.
    For chunk i>0, prepend overlap_frames frames from the end of the previous chunk for stability only, but drop overlapped frames from the output to avoid duplicates.

  • return_frames_as_images / max_frames_to_return: Optional frame tensor output.
    Recommended to keep disabled for long videos; use frames_dir instead.

  • keep_temp: Keep intermediate chunk files on disk for debugging.

Outputs

  • video: Final merged MP4 (audio is muxed from the original input video)
  • frames_dir: Directory containing merged PNG frames (frame_%08d.png) for downstream processing
  • frames: Optional IMAGE batch (guarded by max_frames_to_return)

Requirements

This node uses FFmpeg/FFprobe for probing, cutting and encoding.
Make sure ffmpeg and ffprobe are available in your system PATH.


🧠 Memory & Performance Optimizations (JR Fork Core)

🧠 Memory & Performance Optimizations (JR Fork Core)

This section explains why the JR fork can run higher resolution / longer videos on mid-range GPUs. Full technical details and implementation notes are available in:

πŸ“„ /docs/DreamID-V Memory & Performance Optimizations_01022026.md

Unlike the original DreamID-V implementationβ€”which primarily targets high-end GPUs (RTX 4090 / 5090)β€”the JR fork introduces a series of architecture-aware, inference-time optimizations that significantly reduce peak VRAM usage without sacrificing output quality.

These optimizations are not simple parameter tweaks, but targeted improvements based on how DiT / Transformer-based diffusion models behave at scale.


πŸ”‘ Key Optimizations Overview

1️⃣ FFN Chunking (Most Critical)

DreamID-V uses a Diffusion Transformer (DiT) backbone. In DiT models, the Feed-Forward Network (FFN) inside each Transformer block is the largest source of peak VRAM usage.

The JR fork introduces FFN Chunking:

  • Splits the token dimension N into smaller chunks
  • Computes FFN activations chunk-by-chunk
  • Reassembles the output without numerical approximation

βœ” Exact same output βœ” Significantly lower peak VRAM βœ” No quality loss

Typical runtime log:

[FFN-Chunk] patched 30 FFN blocks with auto-chunking
[FFN-Chunk] auto select chunks=8 for N=10800

2️⃣ Token Count Awareness (What is N?)

In DreamID-V (DiT-based), memory usage scales with the number of tokens (N), not just resolution.

Simplified:

N β‰ˆ T Γ— (H / stride) Γ— (W / stride)

Where:

  • T = number of frames processed in a chunk
  • H, W = spatial resolution
  • stride = model patch / VAE stride (typically 8 or 16)

As resolution and frame count increase, N grows rapidly, directly impacting FFN and Attention memory usage.

The JR fork dynamically selects FFN chunk counts based on N, instead of using a fixed configuration.


3️⃣ Recommended FFN Chunk Heuristics

Based on extensive testing, the following practical rule-of-thumb is used:

recommended_chunks β‰ˆ ceil(N / 4000)
clamped to range [2, 16]

Typical Examples

Resolution Frames per chunk Typical N Recommended FFN Chunks
512p 16 ~6k 2–4
720p 16 ~9k 4–6
1024p 16 ~11–14k 6–8
1280p 16 ~18–22k 8–12
1280p 41 ~25–28k 8–16

πŸ’‘ Practical Sweet Spot For most 1024p–1280p workloads, FFN chunks = 8 offers the best stability/performance balance.


4️⃣ VAE Temporal Micro-Batching (Encode & Decode)

Problem

The original implementation encodes and decodes entire video sequences at once, which often causes:

  • Sudden VRAM spikes
  • CUDA allocator fragmentation
  • OOM in conv / F.pad layers

JR Fork Solution

  • Temporal chunking along the time dimension for VAE encode
  • Automatic fallback to smaller temporal chunks on OOM
  • Optimized decode path that avoids incremental torch.cat (a common VRAM trap)

Result:

βœ” Lower peak VRAM βœ” Much higher stability on long videos βœ” Identical decoded output


5️⃣ Warmup Pass (Formalized)

A long-standing community trick:

β€œRun 1 second first, then run the full video.”

The JR fork formalizes this into an explicit warmup pass:

  • Same resolution and backend as the real run
  • Very short duration (e.g. 1 second)
  • Very few steps (e.g. 4)
  • No output is saved
  • No empty_cache() is called

Purpose:

  • Initialize CUDA context
  • Warm up kernels / autotune paths
  • Stabilize the memory allocator

This dramatically improves first-chunk stability and reduces unexplained early OOMs.


6️⃣ Cache Management (Critical Rule)

The JR fork follows a strict cache policy:

Location Clear Cache?
Inside FFN / Attention ❌
During sampling steps ❌
During VAE encode/decode ❌
Between video chunks βœ… (once per chunk)

This preserves allocator β€œwarm state” while preventing long-term memory accumulation.


πŸ“ˆ Real-World Results

5-second video, 720Γ—1280 resolution, 16 FPS (Wan-Faster):

Metric Before JR Optimizations After JR Optimizations
Cold start OOM Stable
Peak VRAM ~17–18 GB ~15–16 GB
Inference VRAM ~99% ~8–10 GB
Runtime 20+ min / unstable ~18 min stable
Output quality – Identical

πŸ“„ Further Reading

For a deep dive into implementation details, including:

  • FFN chunk patching
  • Token dimension analysis
  • VAE stride behavior
  • Decode memory pitfalls
  • Warmup design rationale

πŸ‘‰ See: πŸ“˜ /docs/DreamID-V Memory & Performance Optimizations_01022026.md


🧩 Why This Matters

These optimizations transform DreamID-V from:

❌ β€œOnly usable on flagship GPUs”

into:

βœ… β€œPredictable, tunable, and stable on 16GB-class hardware”

β€” without modifying the trained model or sacrificing visual fidelity.


⚠️ Important: FlashAttention 2 Requirement (Wan-Faster Backend)

This section is critical. Please read before using wan_faster.

Why FlashAttention 2 is Required

The wan_faster backend in DreamID-V relies on FlashAttention 2 (FA2) for its attention implementation.

  • FA2 is not optional
  • FA2 is not automatically installed
  • Without FA2, wan_faster will crash at runtime

If FlashAttention 2 is missing or incompatible, you will see errors such as:

AssertionError: FLASH_ATTN_2_AVAILABLE

or

CUDA error: no kernel image is available for execution on the device

Supported GPUs (FlashAttention 2)

FlashAttention 2 requires Ampere or newer GPUs.

GPU Compute Capability FA2 Support
RTX 3060 SM 8.6 βœ… Supported
RTX 3080 / 3090 SM 8.6 βœ… Supported
RTX 40xx SM 8.9 βœ… Supported
RTX 20xx SM 7.5 ❌ Not supported

Check your GPU capability:

python -c "import torch; print(torch.cuda.get_device_name(0)); print(torch.cuda.get_device_capability(0))"

Installing FlashAttention 2 (Windows / Linux)

⚠️ FlashAttention wheels are CUDA & Python version specific

Recommended environment (tested):

  • Python 3.10-12
  • CUDA 12.x / 13.0
  • PyTorch CUDA build

Install:

pip install flash-attn==2.8.2 --no-build-isolation

Then verify:

python - <<EOF
import torch
from flash_attn import flash_attn_func
q = torch.randn(1, 128, 8, 64, device="cuda", dtype=torch.float16)
k = torch.randn_like(q)
v = torch.randn_like(q)
o = flash_attn_func(q, k, v, dropout_p=0.0, causal=False)
print("FlashAttention OK:", o.shape)
EOF

If this test passes, Wan-Faster will work.

You can also use whl from: https://github.com/Goldlionren/AI-windows-whl.git


Backend Behavior Summary

Backend FlashAttention 2 Notes
wan ❌ Not required Slower, more compatible
wan_faster βœ… Required Faster, fewer steps

If your GPU does not support FA2, use:

backend = wan

πŸ•Ί DWPose Pose Backend (JR Enhancement)

This JR fork integrates DWPose (ONNX-based) as the default pose extraction backend, replacing the legacy MediaPipe FaceMesh pipeline.

Why DWPose?

  • βœ… Much higher robustness on fast / complex motions
  • βœ… Fewer "no pose detected" failures
  • βœ… GPU-accelerated via ONNX Runtime (CUDA / TensorRT)
  • βœ… Fully independent from PyTorch device placement

Backend Behavior

  • Default backend: dwpose
  • Automatic fallback: MediaPipe (if ONNXRuntime or GPU is unavailable)
  • Device independence:
    • T5 can run on CPU
    • DWPose can still run on GPU
    • No cross-interference between PyTorch and ONNXRuntime

Required DWPose Models (ONNX)

Place the following ONNX models under:

ComfyUI/models/DreamID-V/pose/models/
β”œβ”€β”€ dw-ll_ucoco_384.onnx
└── yolox_l.onnx

⚠️ These models are NOT included in the repository.


Automatic Download (Optional)

JR fork supports automatic download of DWPose ONNX models.

Enable by setting the environment variable:

DREAMIDV_AUTO_DOWNLOAD_DWPOSE=1

If disabled, a clear error message will indicate which files are missing and where to place them.


ONNXRuntime Acceleration

  • Supported providers:
    • CUDAExecutionProvider
    • TensorrtExecutionProvider (if available)
    • CPUExecutionProvider (fallback)

Runtime log example:

[DWPose] det providers : ['CUDAExecutionProvider', 'CPUExecutionProvider']
[DWPose] pose providers: ['CUDAExecutionProvider', 'CPUExecutionProvider']

πŸš€ Usage

  1. Install ComfyUI (Python β‰₯ 3.10 recommended)
  2. Clone this repository into:
ComfyUI/custom_nodes/ComfyUI_JR_DreamID-V
  1. Install dependencies:
pip install -r requirements.txt
  1. Download required DreamID-V and DWPose models
  2. Launch ComfyUI

πŸ’‘ Note:
Selecting cpu for T5 does NOT affect DWPose.
Pose extraction runs via ONNXRuntime and can still use GPU acceleration.


πŸ–₯️ System Requirements

  • OS: Windows / Linux
  • GPU: NVIDIA (16GB VRAM recommended)
  • Python: 3.10+
  • PyTorch: CUDA-enabled build
  • ONNX Runtime:
    • onnxruntime (CPU)
    • onnxruntime-gpu (recommended for GPU acceleration)

πŸ”€ JR Fork Highlights

Compared to the original DreamID-V:

  • βœ… DWPose (ONNX) replaces MediaPipe for pose extraction
  • βœ… GPU-accelerated pose detection (CUDA / TensorRT)
  • βœ… Clear separation of T5 / Pose / UNet devices
  • βœ… Improved stability on real-world videos

πŸ› οΈ Installation

Method 1: ComfyUI Manager (Future Support)

  1. Install ComfyUI Manager
  2. Search for ComfyUI_JR_DreamID-V
  3. Install

Method 2: Manual Installation (Recommended)

  1. Navigate to ComfyUI custom_nodes directory:
cd ComfyUI/custom_nodes
  1. Clone the JR fork:
git clone https://github.com/<your-github-username>/ComfyUI_JR_DreamID-V.git
  1. Install dependencies:
cd ComfyUI_JR_DreamID-V
pip install -r requirements.txt

πŸ“¦ Model Downloads & Setup

Model preparation is identical to the original project.

1. Wan2.1-T2V-1.3B Base Model

Download: πŸ€— https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B

Directory layout:

ComfyUI/models/Wan/Wan2.1-T2V-1.3B/
β”œβ”€β”€ models_t5_umt5-xxl-enc-bf16.pth
β”œβ”€β”€ Wan2.1_VAE.pth
└── google/umt5-xxl/

2. DreamID-V Model

Download: πŸ€— https://huggingface.co/XuGuo699/DreamID-V

Directory:

ComfyUI/models/DreamID-V/
└── dreamidv.pth

3. DreamID-V Wan-Faster Model (Required for wan_faster backend)

If you plan to use the Wan-Faster backend, an additional model file is required.

Download

Download dreamidv_faster.pth from the official DreamID-V repository:

πŸ‘‰ https://github.com/bytedance/DreamID-V
(See the Wan-1.3B-Faster section in the upstream README)

⚠️ This repository does NOT redistribute dreamidv_faster.pth. Please download it directly from the original authors.

Placement

Directory:

ComfyUI/models/DreamID-V/
β”œβ”€β”€ dreamidv.pth
└── dreamidv_faster.pth

Usage Notes

  • dreamidv_faster.pth is only required when using: backend = wan_faster

  • The standard backend (wan) continues to use dreamidv.pth

  • The loader will automatically select the correct checkpoint based on the selected backend


πŸš€ Usage (JR Recommended)

  1. Add JR_DreamID-V_Loader

  2. Select T5 device:

    • cuda:1 (recommended for dual-GPU)
    • cuda:0
    • cpu (low-VRAM / fallback)
  3. Add JR_DreamID-V_Sampler

  4. Connect:

    • pipeline
    • video
    • ref_image
  5. Configure parameters and run


πŸ’» System Requirements (JR Fork)

  • GPU:

    • βœ… RTX 4060 Ti / RTX 4080 (16GB tested)
    • βœ… Dual-GPU setups supported
  • Python: 3.8+

  • CUDA: 11.7+

  • ComfyUI: Latest version

⚠️ Larger VRAM improves performance, but 4090 / 5090 are NOT required.


πŸš€ Using Wan-Faster Backend (Recommended)

  1. Add JR_DreamID-V_Loader
  2. Add JR_DreamID-V_Sampler or JR_DreamID-V_LongVideo_Sampler
  3. In the Sampler node:
    • Set backend = wan_faster
    • Set sampling_steps = 12
    • Ensure sample_solver = unipc
  4. (LongVideo only) Set:
    • fps = 16 (recommended)
  5. Run the workflow

Notes

  • wan_faster does not use pose reference video internally.
  • Reference inputs are limited to:
    • source video
    • face mask video
    • reference image
  • Progress is reported via ComfyUI’s native green progress bar.

⚠️ MediaPipe Dependency (Optional)

MediaPipe is NOT required when:

  • Using DWPose
  • Using wan_faster
  • Using Python β‰₯ 3.12 (MediaPipe incompatible)

JR Fork makes MediaPipe optional:

  • If MediaPipe is unavailable β†’ automatically skipped
  • No impact on DWPose-based workflows
  • No impact on Wan-Faster backend

🧠 Key Design Decision (JR Fork)

Pose extraction, text encoding, and diffusion are fully decoupled

  • T5 can run on CPU
  • DWPose runs via ONNXRuntime (GPU)
  • Wan-Faster requires FlashAttention 2
  • No cross-device interference

πŸ“ License & Fork Statement

  • Licensed under Apache License 2.0
  • This repository is a fork of:
HM-RunningHub / ComfyUI_RH_DreamID-V

Original copyright belongs to the original authors. All modifications in this repository are made under the terms of Apache-2.0.


πŸ™ Acknowledgements

  • DreamID-V (ByteDance)
  • Wan Team
  • ComfyUI
  • Original RunningHub project authors

Special thanks to the original DreamID-V authors for introducing the Wan-1.3B-Faster model and inference pipeline, which enables significantly faster generation with reduced sampling steps.


⚠️ Disclaimer

This project is for research and educational purposes only. Please comply with local laws and regulations. Do not use this project for illegal activities or rights-infringing purposes.


If you find this JR fork helpful, please consider giving it a ⭐ Star!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages