ComfyUI_JR_DreamID-V

ComfyUI_JR_DreamID-V is a JR-maintained fork of 👉 HM-RunningHub / ComfyUI_RH_DreamID-V.

This project is based on DreamID-V and provides high-fidelity video face swapping inside ComfyUI. Unlike the original fork which primarily targets high-end GPUs (RTX 4090 / 5090), this JR fork focuses on broader hardware accessibility and stability.

✨ Goals of the JR Fork

Make DreamID-V usable for more people, not only top-tier GPUs.

This fork is designed to:

✅ Support 16GB VRAM GPUs (e.g. RTX 4060 Ti, RTX 4080)
✅ Support dual-GPU setups (e.g. T5 encoder on a second GPU)
✅ Support CPU / GPU mixed offloading
✅ Reduce OOM issues on mid-range hardware
✅ Preserve full compatibility with existing RunningHub workflows
✅ Can run 720x1280 FPS16 5s face swap on RTX5060Ti 16G

✨ Features

🎭 High-Fidelity Face Swapping Video face swapping powered by Diffusion Transformer
🎬 Video-Driven Motion Use a video as the motion / pose driver
🖼️ Reference Image Identity Single face image as identity reference
🔧 Native ComfyUI Integration Seamlessly integrated into ComfyUI workflows
🧠 Low-VRAM Friendly (JR Fork) Flexible device placement for T5 / main model / VAE
🕺 DWPose (ONNX, GPU-Accelerated) Pose Extraction (JR Fork)
Replaces legacy MediaPipe pose extraction with DWPose (ONNXRuntime).
Supports CUDA / TensorRT acceleration, significantly improving stability and accuracy on complex motion videos.
🕺 Wan_Faster version (JR Fork)
Replaces legacy MediaPipe pose extraction with DWPose (ONNXRuntime).
introducing DreamID-V-Wan-1.3B-DWPose. This significantly improves stability and robustness in pose extraction.

English Update (Release Notes)

JR_LoadVideoPlus Node Updates

Added target_long_edge: scales by the longer edge (in pixels) while preserving the original aspect ratio, eliminating manual width/height calculations.
Kept VIDEO output unchanged: remains drop-in compatible with the existing sampler pipeline (no downstream changes required).
Added 3 extra output ports for easier downstream wiring and dynamic parameter linkage:
- out_w: final output width
- out_h: final output height
- out_fps: final output FPS
Normalized out_fps to an integer (int, using round) for improved consistency and downstream compatibility.

Note: When target_long_edge is set (>0), it takes priority and the shorter edge is computed automatically. scale_w/scale_h only applies when target_long_edge is disabled.

📋 Nodes

This plugin provides two sets of nodes, with full backward compatibility.

✅ JR Nodes (Recommended)

Node Name	Description
`JR_DreamID-V_Loader`	Load DreamID-V pipeline (device-selectable)
`JR_DreamID-V_Sampler`	Run video face swapping
`JR_DreamID-V_LongVideo_Sampler`	Run long video face swapping via chunking (recommended for long videos)

🔁 Legacy Nodes (Compatibility)

Node Name	Description
`RunningHub_DreamID-V_Loader`	Legacy loader (for old workflows)
`RunningHub_DreamID-V_Sampler`	Legacy sampler

💡 New workflows should use JR nodes. Existing workflows will continue to work without modification.

⚡ DreamID-V Wan-Faster Backend (JR Integrated)

This fork integrates DreamID-V Wan-1.3B-Faster as an optional backend, providing significantly faster inference with reduced sampling steps, while maintaining identity fidelity.

Recommended Settings (Wan-Faster)

Parameter	Recommended	Notes
`backend`	`wan_faster`	Must be explicitly selected in Sampler
`sampling_steps`	12	Faster model is trained for short schedules
`fps` (LongVideo)	16	Strong speed / quality balance
`sample_solver`	`unipc`	Required (others not supported)
`frame_num`	81	Same as standard DreamID-V
`overlap_frames`	8–12	Recommended for long videos

⚠️ Using higher sampling steps (e.g. 20+) with wan_faster provides no quality benefit and only increases runtime.

🎞️ Long Video (Chunked) Sampler (JR Enhancement)

JR_DreamID-V_LongVideo_Sampler is designed for long videos that would otherwise OOM when processed as a single clip. It splits the input into chunks, processes each chunk sequentially, writes intermediate frames to disk, and finally merges them into a single output video.

Key Parameters

frame_num: Chunk size in frames.
Example: total 1620 frames, frame_num=81 → 20 chunks.

💡 Wan-Faster Recommendation:
When using backend=wan_faster, fps=16 is strongly recommended for optimal speed/quality tradeoff.

fps (LongVideo only):
- -1 (default): follow source video FPS (no resampling)
- >= 1: time-based FPS resampling before pose/mask and inference
  - Keeps the original duration (no slow-motion)
  - Reduces the number of frames processed by the model
  - Significantly improves speed for high-FPS sources (e.g., 60 → 24)
overlap_frames: Warm-up overlap for temporal stability.
For chunk i>0, prepend overlap_frames frames from the end of the previous chunk for stability only, but drop overlapped frames from the output to avoid duplicates.
return_frames_as_images / max_frames_to_return: Optional frame tensor output.
Recommended to keep disabled for long videos; use frames_dir instead.
keep_temp: Keep intermediate chunk files on disk for debugging.

Outputs

video: Final merged MP4 (audio is muxed from the original input video)
frames_dir: Directory containing merged PNG frames (frame_%08d.png) for downstream processing
frames: Optional IMAGE batch (guarded by max_frames_to_return)

Requirements

This node uses FFmpeg/FFprobe for probing, cutting and encoding.
Make sure ffmpeg and ffprobe are available in your system PATH.

🧠 Memory & Performance Optimizations (JR Fork Core)

This section explains why the JR fork can run higher resolution / longer videos on mid-range GPUs. Full technical details and implementation notes are available in:

📄 /docs/DreamID-V Memory & Performance Optimizations_01022026.md

Unlike the original DreamID-V implementation—which primarily targets high-end GPUs (RTX 4090 / 5090)—the JR fork introduces a series of architecture-aware, inference-time optimizations that significantly reduce peak VRAM usage without sacrificing output quality.

These optimizations are not simple parameter tweaks, but targeted improvements based on how DiT / Transformer-based diffusion models behave at scale.

🔑 Key Optimizations Overview

1️⃣ FFN Chunking (Most Critical)

DreamID-V uses a Diffusion Transformer (DiT) backbone. In DiT models, the Feed-Forward Network (FFN) inside each Transformer block is the largest source of peak VRAM usage.

The JR fork introduces FFN Chunking:

Splits the token dimension N into smaller chunks
Computes FFN activations chunk-by-chunk
Reassembles the output without numerical approximation

✔ Exact same output ✔ Significantly lower peak VRAM ✔ No quality loss

Typical runtime log:

[FFN-Chunk] patched 30 FFN blocks with auto-chunking
[FFN-Chunk] auto select chunks=8 for N=10800

2️⃣ Token Count Awareness (What is `N`?)

In DreamID-V (DiT-based), memory usage scales with the number of tokens (N), not just resolution.

Simplified:

N ≈ T × (H / stride) × (W / stride)

Where:

T = number of frames processed in a chunk
H, W = spatial resolution
stride = model patch / VAE stride (typically 8 or 16)

As resolution and frame count increase, N grows rapidly, directly impacting FFN and Attention memory usage.

The JR fork dynamically selects FFN chunk counts based on N, instead of using a fixed configuration.

3️⃣ Recommended FFN Chunk Heuristics

Based on extensive testing, the following practical rule-of-thumb is used:

recommended_chunks ≈ ceil(N / 4000)
clamped to range [2, 16]

Typical Examples

Resolution	Frames per chunk	Typical N	Recommended FFN Chunks
512p	16	~6k	2–4
720p	16	~9k	4–6
1024p	16	~11–14k	6–8
1280p	16	~18–22k	8–12
1280p	41	~25–28k	8–16

💡 Practical Sweet Spot For most 1024p–1280p workloads, FFN chunks = 8 offers the best stability/performance balance.

4️⃣ VAE Temporal Micro-Batching (Encode & Decode)

Problem

The original implementation encodes and decodes entire video sequences at once, which often causes:

Sudden VRAM spikes
CUDA allocator fragmentation
OOM in conv / F.pad layers

JR Fork Solution

Temporal chunking along the time dimension for VAE encode
Automatic fallback to smaller temporal chunks on OOM
Optimized decode path that avoids incremental torch.cat (a common VRAM trap)

Result:

✔ Lower peak VRAM ✔ Much higher stability on long videos ✔ Identical decoded output

5️⃣ Warmup Pass (Formalized)

A long-standing community trick:

“Run 1 second first, then run the full video.”

The JR fork formalizes this into an explicit warmup pass:

Same resolution and backend as the real run
Very short duration (e.g. 1 second)
Very few steps (e.g. 4)
No output is saved
No empty_cache() is called

Purpose:

Initialize CUDA context
Warm up kernels / autotune paths
Stabilize the memory allocator

This dramatically improves first-chunk stability and reduces unexplained early OOMs.

6️⃣ Cache Management (Critical Rule)

The JR fork follows a strict cache policy:

Location	Clear Cache?
Inside FFN / Attention	❌
During sampling steps	❌
During VAE encode/decode	❌
Between video chunks	✅ (once per chunk)

This preserves allocator “warm state” while preventing long-term memory accumulation.

📈 Real-World Results

5-second video, 720×1280 resolution, 16 FPS (Wan-Faster):

Metric	Before JR Optimizations	After JR Optimizations
Cold start	OOM	Stable
Peak VRAM	~17–18 GB	~15–16 GB
Inference VRAM	~99%	~8–10 GB
Runtime	20+ min / unstable	~18 min stable
Output quality	–	Identical

📄 Further Reading

For a deep dive into implementation details, including:

FFN chunk patching

Token dimension analysis

VAE stride behavior

Decode memory pitfalls

Warmup design rationale

👉 See: 📘 /docs/DreamID-V Memory & Performance Optimizations_01022026.md

🧩 Why This Matters

These optimizations transform DreamID-V from:

❌ “Only usable on flagship GPUs”

into:

✅ “Predictable, tunable, and stable on 16GB-class hardware”

— without modifying the trained model or sacrificing visual fidelity.

⚠️ Important: FlashAttention 2 Requirement (Wan-Faster Backend)

This section is critical. Please read before using wan_faster.

Why FlashAttention 2 is Required

The wan_faster backend in DreamID-V relies on FlashAttention 2 (FA2) for its attention implementation.

FA2 is not optional
FA2 is not automatically installed
Without FA2, wan_faster will crash at runtime

If FlashAttention 2 is missing or incompatible, you will see errors such as:

AssertionError: FLASH_ATTN_2_AVAILABLE

or

CUDA error: no kernel image is available for execution on the device

Supported GPUs (FlashAttention 2)

FlashAttention 2 requires Ampere or newer GPUs.

GPU	Compute Capability	FA2 Support
RTX 3060	SM 8.6	✅ Supported
RTX 3080 / 3090	SM 8.6	✅ Supported
RTX 40xx	SM 8.9	✅ Supported
RTX 20xx	SM 7.5	❌ Not supported

Check your GPU capability:

python -c "import torch; print(torch.cuda.get_device_name(0)); print(torch.cuda.get_device_capability(0))"

Installing FlashAttention 2 (Windows / Linux)

⚠️ FlashAttention wheels are CUDA & Python version specific

Recommended environment (tested):

Python 3.10-12
CUDA 12.x / 13.0
PyTorch CUDA build

Install:

pip install flash-attn==2.8.2 --no-build-isolation

Then verify:

python - <<EOF
import torch
from flash_attn import flash_attn_func
q = torch.randn(1, 128, 8, 64, device="cuda", dtype=torch.float16)
k = torch.randn_like(q)
v = torch.randn_like(q)
o = flash_attn_func(q, k, v, dropout_p=0.0, causal=False)
print("FlashAttention OK:", o.shape)
EOF

If this test passes, Wan-Faster will work.

You can also use whl from: https://github.com/Goldlionren/AI-windows-whl.git

Backend Behavior Summary

Backend	FlashAttention 2	Notes
`wan`	❌ Not required	Slower, more compatible
`wan_faster`	✅ Required	Faster, fewer steps

If your GPU does not support FA2, use:

backend = wan

🕺 DWPose Pose Backend (JR Enhancement)

This JR fork integrates DWPose (ONNX-based) as the default pose extraction backend, replacing the legacy MediaPipe FaceMesh pipeline.

Why DWPose?

✅ Much higher robustness on fast / complex motions
✅ Fewer "no pose detected" failures
✅ GPU-accelerated via ONNX Runtime (CUDA / TensorRT)
✅ Fully independent from PyTorch device placement

Backend Behavior

Default backend: dwpose
Automatic fallback: MediaPipe (if ONNXRuntime or GPU is unavailable)
Device independence:
- T5 can run on CPU
- DWPose can still run on GPU
- No cross-interference between PyTorch and ONNXRuntime

Required DWPose Models (ONNX)

Place the following ONNX models under:

ComfyUI/models/DreamID-V/pose/models/
├── dw-ll_ucoco_384.onnx
└── yolox_l.onnx

⚠️ These models are NOT included in the repository.

Automatic Download (Optional)

JR fork supports automatic download of DWPose ONNX models.

Enable by setting the environment variable:

DREAMIDV_AUTO_DOWNLOAD_DWPOSE=1

If disabled, a clear error message will indicate which files are missing and where to place them.

ONNXRuntime Acceleration

Supported providers:
- CUDAExecutionProvider
- TensorrtExecutionProvider (if available)
- CPUExecutionProvider (fallback)

Runtime log example:

[DWPose] det providers : ['CUDAExecutionProvider', 'CPUExecutionProvider']
[DWPose] pose providers: ['CUDAExecutionProvider', 'CPUExecutionProvider']

🚀 Usage

Install ComfyUI (Python ≥ 3.10 recommended)
Clone this repository into:

ComfyUI/custom_nodes/ComfyUI_JR_DreamID-V

Install dependencies:

pip install -r requirements.txt

Download required DreamID-V and DWPose models
Launch ComfyUI

💡 Note:
Selecting cpu for T5 does NOT affect DWPose.
Pose extraction runs via ONNXRuntime and can still use GPU acceleration.

🖥️ System Requirements

OS: Windows / Linux
GPU: NVIDIA (16GB VRAM recommended)
Python: 3.10+
PyTorch: CUDA-enabled build
ONNX Runtime:
- onnxruntime (CPU)
- onnxruntime-gpu (recommended for GPU acceleration)

🔀 JR Fork Highlights

Compared to the original DreamID-V:

✅ DWPose (ONNX) replaces MediaPipe for pose extraction
✅ GPU-accelerated pose detection (CUDA / TensorRT)
✅ Clear separation of T5 / Pose / UNet devices
✅ Improved stability on real-world videos

🛠️ Installation

Method 1: ComfyUI Manager (Future Support)

Install ComfyUI Manager
Search for ComfyUI_JR_DreamID-V
Install

Method 2: Manual Installation (Recommended)

Navigate to ComfyUI custom_nodes directory:

cd ComfyUI/custom_nodes

Clone the JR fork:

git clone https://github.com/<your-github-username>/ComfyUI_JR_DreamID-V.git

Install dependencies:

cd ComfyUI_JR_DreamID-V
pip install -r requirements.txt

📦 Model Downloads & Setup

Model preparation is identical to the original project.

1. Wan2.1-T2V-1.3B Base Model

Download: 🤗 https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B

Directory layout:

ComfyUI/models/Wan/Wan2.1-T2V-1.3B/
├── models_t5_umt5-xxl-enc-bf16.pth
├── Wan2.1_VAE.pth
└── google/umt5-xxl/

2. DreamID-V Model

Download: 🤗 https://huggingface.co/XuGuo699/DreamID-V

Directory:

ComfyUI/models/DreamID-V/
└── dreamidv.pth

3. DreamID-V Wan-Faster Model (Required for `wan_faster` backend)

If you plan to use the Wan-Faster backend, an additional model file is required.

Download

Download dreamidv_faster.pth from the official DreamID-V repository:

👉 https://github.com/bytedance/DreamID-V
(See the Wan-1.3B-Faster section in the upstream README)

⚠️ This repository does NOT redistribute dreamidv_faster.pth. Please download it directly from the original authors.

Placement

Directory:

ComfyUI/models/DreamID-V/
├── dreamidv.pth
└── dreamidv_faster.pth

Usage Notes

dreamidv_faster.pth is only required when using: backend = wan_faster
The standard backend (wan) continues to use dreamidv.pth
The loader will automatically select the correct checkpoint based on the selected backend

🚀 Usage (JR Recommended)

Add JR_DreamID-V_Loader
Select T5 device:
- cuda:1 (recommended for dual-GPU)
- cuda:0
- cpu (low-VRAM / fallback)
Add JR_DreamID-V_Sampler
Connect:
- pipeline
- video
- ref_image
Configure parameters and run

💻 System Requirements (JR Fork)

GPU:
- ✅ RTX 4060 Ti / RTX 4080 (16GB tested)
- ✅ Dual-GPU setups supported
Python: 3.8+
CUDA: 11.7+
ComfyUI: Latest version

⚠️ Larger VRAM improves performance, but 4090 / 5090 are NOT required.

🚀 Using Wan-Faster Backend (Recommended)

Add JR_DreamID-V_Loader
Add JR_DreamID-V_Sampler or JR_DreamID-V_LongVideo_Sampler
In the Sampler node:
- Set backend = wan_faster
- Set sampling_steps = 12
- Ensure sample_solver = unipc
(LongVideo only) Set:
- fps = 16 (recommended)
Run the workflow

Notes

wan_faster does not use pose reference video internally.
Reference inputs are limited to:
- source video
- face mask video
- reference image
Progress is reported via ComfyUI’s native green progress bar.

⚠️ MediaPipe Dependency (Optional)

MediaPipe is NOT required when:

Using DWPose
Using wan_faster
Using Python ≥ 3.12 (MediaPipe incompatible)

JR Fork makes MediaPipe optional:

If MediaPipe is unavailable → automatically skipped
No impact on DWPose-based workflows
No impact on Wan-Faster backend

🧠 Key Design Decision (JR Fork)

Pose extraction, text encoding, and diffusion are fully decoupled

T5 can run on CPU
DWPose runs via ONNXRuntime (GPU)
Wan-Faster requires FlashAttention 2
No cross-device interference

📝 License & Fork Statement

Licensed under Apache License 2.0
This repository is a fork of:

HM-RunningHub / ComfyUI_RH_DreamID-V

Original copyright belongs to the original authors. All modifications in this repository are made under the terms of Apache-2.0.

🙏 Acknowledgements

DreamID-V (ByteDance)
Wan Team
ComfyUI
Original RunningHub project authors

Special thanks to the original DreamID-V authors for introducing the Wan-1.3B-Faster model and inference pipeline, which enables significantly faster generation with reduced sampling steps.

⚠️ Disclaimer

This project is for research and educational purposes only. Please comply with local laws and regulations. Do not use this project for illegal activities or rights-infringing purposes.

If you find this JR fork helpful, please consider giving it a ⭐ Star!

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
docs		docs
dreamidv_wan		dreamidv_wan
dreamidv_wan_faster		dreamidv_wan_faster
express_adaption		express_adaption
pose		pose
workflow examples		workflow examples
.gitignore		.gitignore
NOTICE		NOTICE
README.md		README.md
README_CN.md		README_CN.md
__init__.py		__init__.py
generate_dreamidv_faster.py		generate_dreamidv_faster.py
nodes.py		nodes.py
nodes.py.bak		nodes.py.bak
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ComfyUI_JR_DreamID-V

✨ Goals of the JR Fork

✨ Features

English Update (Release Notes)

JR_LoadVideoPlus Node Updates

📋 Nodes

✅ JR Nodes (Recommended)

🔁 Legacy Nodes (Compatibility)

⚡ DreamID-V Wan-Faster Backend (JR Integrated)

Recommended Settings (Wan-Faster)

🎞️ Long Video (Chunked) Sampler (JR Enhancement)

Key Parameters

Outputs

Requirements

🧠 Memory & Performance Optimizations (JR Fork Core)

🧠 Memory & Performance Optimizations (JR Fork Core)

🔑 Key Optimizations Overview

1️⃣ FFN Chunking (Most Critical)

2️⃣ Token Count Awareness (What is N?)

3️⃣ Recommended FFN Chunk Heuristics

Typical Examples

4️⃣ VAE Temporal Micro-Batching (Encode & Decode)

Problem

JR Fork Solution

5️⃣ Warmup Pass (Formalized)

6️⃣ Cache Management (Critical Rule)

📈 Real-World Results

📄 Further Reading

🧩 Why This Matters

⚠️ Important: FlashAttention 2 Requirement (Wan-Faster Backend)

Why FlashAttention 2 is Required

Supported GPUs (FlashAttention 2)

Installing FlashAttention 2 (Windows / Linux)

Backend Behavior Summary

🕺 DWPose Pose Backend (JR Enhancement)

Why DWPose?

Backend Behavior

Required DWPose Models (ONNX)

Automatic Download (Optional)

ONNXRuntime Acceleration

🚀 Usage

🖥️ System Requirements

🔀 JR Fork Highlights

🛠️ Installation

Method 1: ComfyUI Manager (Future Support)

Method 2: Manual Installation (Recommended)

📦 Model Downloads & Setup

1. Wan2.1-T2V-1.3B Base Model

2. DreamID-V Model

3. DreamID-V Wan-Faster Model (Required for wan_faster backend)

Download

Placement

Usage Notes

🚀 Usage (JR Recommended)

💻 System Requirements (JR Fork)

🚀 Using Wan-Faster Backend (Recommended)

Notes

⚠️ MediaPipe Dependency (Optional)

🧠 Key Design Decision (JR Fork)

📝 License & Fork Statement

🙏 Acknowledgements

⚠️ Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2️⃣ Token Count Awareness (What is `N`?)

3. DreamID-V Wan-Faster Model (Required for `wan_faster` backend)

Packages