ComfyUI_JR_DreamID-V is a JR-maintained fork of π HM-RunningHub / ComfyUI_RH_DreamID-V.
This project is based on DreamID-V and provides high-fidelity video face swapping inside ComfyUI. Unlike the original fork which primarily targets high-end GPUs (RTX 4090 / 5090), this JR fork focuses on broader hardware accessibility and stability.
Make DreamID-V usable for more people, not only top-tier GPUs.
This fork is designed to:
- β Support 16GB VRAM GPUs (e.g. RTX 4060 Ti, RTX 4080)
- β Support dual-GPU setups (e.g. T5 encoder on a second GPU)
- β Support CPU / GPU mixed offloading
- β Reduce OOM issues on mid-range hardware
- β Preserve full compatibility with existing RunningHub workflows
- β Can run 720x1280 FPS16 5s face swap on RTX5060Ti 16G
-
π High-Fidelity Face Swapping Video face swapping powered by Diffusion Transformer
-
π¬ Video-Driven Motion Use a video as the motion / pose driver
-
πΌοΈ Reference Image Identity Single face image as identity reference
-
π§ Native ComfyUI Integration Seamlessly integrated into ComfyUI workflows
-
π§ Low-VRAM Friendly (JR Fork) Flexible device placement for T5 / main model / VAE
-
πΊ DWPose (ONNX, GPU-Accelerated) Pose Extraction (JR Fork)
Replaces legacy MediaPipe pose extraction with DWPose (ONNXRuntime).
Supports CUDA / TensorRT acceleration, significantly improving stability and accuracy on complex motion videos. -
πΊ Wan_Faster version (JR Fork)
Replaces legacy MediaPipe pose extraction with DWPose (ONNXRuntime).
introducing DreamID-V-Wan-1.3B-DWPose. This significantly improves stability and robustness in pose extraction.
-
Added
target_long_edge: scales by the longer edge (in pixels) while preserving the original aspect ratio, eliminating manual width/height calculations. -
Kept
VIDEOoutput unchanged: remains drop-in compatible with the existing sampler pipeline (no downstream changes required). -
Added 3 extra output ports for easier downstream wiring and dynamic parameter linkage:
out_w: final output widthout_h: final output heightout_fps: final output FPS
-
Normalized
out_fpsto an integer (int, usinground) for improved consistency and downstream compatibility.
Note: When
target_long_edgeis set (>0), it takes priority and the shorter edge is computed automatically.scale_w/scale_honly applies whentarget_long_edgeis disabled.
This plugin provides two sets of nodes, with full backward compatibility.
| Node Name | Description |
|---|---|
JR_DreamID-V_Loader |
Load DreamID-V pipeline (device-selectable) |
JR_DreamID-V_Sampler |
Run video face swapping |
JR_DreamID-V_LongVideo_Sampler |
Run long video face swapping via chunking (recommended for long videos) |
| Node Name | Description |
|---|---|
RunningHub_DreamID-V_Loader |
Legacy loader (for old workflows) |
RunningHub_DreamID-V_Sampler |
Legacy sampler |
π‘ New workflows should use JR nodes. Existing workflows will continue to work without modification.
This fork integrates DreamID-V Wan-1.3B-Faster as an optional backend, providing significantly faster inference with reduced sampling steps, while maintaining identity fidelity.
| Parameter | Recommended | Notes |
|---|---|---|
backend |
wan_faster |
Must be explicitly selected in Sampler |
sampling_steps |
12 | Faster model is trained for short schedules |
fps (LongVideo) |
16 | Strong speed / quality balance |
sample_solver |
unipc |
Required (others not supported) |
frame_num |
81 | Same as standard DreamID-V |
overlap_frames |
8β12 | Recommended for long videos |
β οΈ Using higher sampling steps (e.g. 20+) withwan_fasterprovides no quality benefit and only increases runtime.
JR_DreamID-V_LongVideo_Sampler is designed for long videos that would otherwise OOM when processed as a single clip.
It splits the input into chunks, processes each chunk sequentially, writes intermediate frames to disk, and finally merges them into a single output video.
frame_num: Chunk size in frames.
Example: total 1620 frames,frame_num=81β 20 chunks.
π‘ Wan-Faster Recommendation:
When usingbackend=wan_faster,fps=16is strongly recommended for optimal speed/quality tradeoff.
-
fps(LongVideo only):-1(default): follow source video FPS (no resampling)>= 1: time-based FPS resampling before pose/mask and inference- Keeps the original duration (no slow-motion)
- Reduces the number of frames processed by the model
- Significantly improves speed for high-FPS sources (e.g., 60 β 24)
-
overlap_frames: Warm-up overlap for temporal stability.
For chunki>0, prependoverlap_framesframes from the end of the previous chunk for stability only, but drop overlapped frames from the output to avoid duplicates. -
return_frames_as_images/max_frames_to_return: Optional frame tensor output.
Recommended to keep disabled for long videos; useframes_dirinstead. -
keep_temp: Keep intermediate chunk files on disk for debugging.
video: Final merged MP4 (audio is muxed from the original input video)frames_dir: Directory containing merged PNG frames (frame_%08d.png) for downstream processingframes: Optional IMAGE batch (guarded bymax_frames_to_return)
This node uses FFmpeg/FFprobe for probing, cutting and encoding.
Make sure ffmpeg and ffprobe are available in your system PATH.
This section explains why the JR fork can run higher resolution / longer videos on mid-range GPUs. Full technical details and implementation notes are available in:
π
/docs/DreamID-V Memory & Performance Optimizations_01022026.md
Unlike the original DreamID-V implementationβwhich primarily targets high-end GPUs (RTX 4090 / 5090)βthe JR fork introduces a series of architecture-aware, inference-time optimizations that significantly reduce peak VRAM usage without sacrificing output quality.
These optimizations are not simple parameter tweaks, but targeted improvements based on how DiT / Transformer-based diffusion models behave at scale.
DreamID-V uses a Diffusion Transformer (DiT) backbone. In DiT models, the Feed-Forward Network (FFN) inside each Transformer block is the largest source of peak VRAM usage.
The JR fork introduces FFN Chunking:
- Splits the token dimension N into smaller chunks
- Computes FFN activations chunk-by-chunk
- Reassembles the output without numerical approximation
β Exact same output β Significantly lower peak VRAM β No quality loss
Typical runtime log:
[FFN-Chunk] patched 30 FFN blocks with auto-chunking
[FFN-Chunk] auto select chunks=8 for N=10800
In DreamID-V (DiT-based), memory usage scales with the number of tokens (N), not just resolution.
Simplified:
N β T Γ (H / stride) Γ (W / stride)
Where:
T= number of frames processed in a chunkH, W= spatial resolutionstride= model patch / VAE stride (typically 8 or 16)
As resolution and frame count increase, N grows rapidly, directly impacting FFN and Attention memory usage.
The JR fork dynamically selects FFN chunk counts based on N, instead of using a fixed configuration.
Based on extensive testing, the following practical rule-of-thumb is used:
recommended_chunks β ceil(N / 4000)
clamped to range [2, 16]
| Resolution | Frames per chunk | Typical N | Recommended FFN Chunks |
|---|---|---|---|
| 512p | 16 | ~6k | 2β4 |
| 720p | 16 | ~9k | 4β6 |
| 1024p | 16 | ~11β14k | 6β8 |
| 1280p | 16 | ~18β22k | 8β12 |
| 1280p | 41 | ~25β28k | 8β16 |
π‘ Practical Sweet Spot For most 1024pβ1280p workloads,
FFN chunks = 8offers the best stability/performance balance.
The original implementation encodes and decodes entire video sequences at once, which often causes:
- Sudden VRAM spikes
- CUDA allocator fragmentation
- OOM in
conv/F.padlayers
- Temporal chunking along the time dimension for VAE encode
- Automatic fallback to smaller temporal chunks on OOM
- Optimized decode path that avoids incremental
torch.cat(a common VRAM trap)
Result:
β Lower peak VRAM β Much higher stability on long videos β Identical decoded output
A long-standing community trick:
βRun 1 second first, then run the full video.β
The JR fork formalizes this into an explicit warmup pass:
- Same resolution and backend as the real run
- Very short duration (e.g. 1 second)
- Very few steps (e.g. 4)
- No output is saved
- No
empty_cache()is called
Purpose:
- Initialize CUDA context
- Warm up kernels / autotune paths
- Stabilize the memory allocator
This dramatically improves first-chunk stability and reduces unexplained early OOMs.
The JR fork follows a strict cache policy:
| Location | Clear Cache? |
|---|---|
| Inside FFN / Attention | β |
| During sampling steps | β |
| During VAE encode/decode | β |
| Between video chunks | β (once per chunk) |
This preserves allocator βwarm stateβ while preventing long-term memory accumulation.
5-second video, 720Γ1280 resolution, 16 FPS (Wan-Faster):
| Metric | Before JR Optimizations | After JR Optimizations |
|---|---|---|
| Cold start | OOM | Stable |
| Peak VRAM | ~17β18 GB | ~15β16 GB |
| Inference VRAM | ~99% | ~8β10 GB |
| Runtime | 20+ min / unstable | ~18 min stable |
| Output quality | β | Identical |
For a deep dive into implementation details, including:
- FFN chunk patching
- Token dimension analysis
- VAE stride behavior
- Decode memory pitfalls
- Warmup design rationale
π See: π
/docs/DreamID-V Memory & Performance Optimizations_01022026.md
These optimizations transform DreamID-V from:
β βOnly usable on flagship GPUsβ
into:
β βPredictable, tunable, and stable on 16GB-class hardwareβ
β without modifying the trained model or sacrificing visual fidelity.
This section is critical. Please read before using
wan_faster.
The wan_faster backend in DreamID-V relies on FlashAttention 2 (FA2) for its attention implementation.
- FA2 is not optional
- FA2 is not automatically installed
- Without FA2,
wan_fasterwill crash at runtime
If FlashAttention 2 is missing or incompatible, you will see errors such as:
AssertionError: FLASH_ATTN_2_AVAILABLE
or
CUDA error: no kernel image is available for execution on the device
FlashAttention 2 requires Ampere or newer GPUs.
| GPU | Compute Capability | FA2 Support |
|---|---|---|
| RTX 3060 | SM 8.6 | β Supported |
| RTX 3080 / 3090 | SM 8.6 | β Supported |
| RTX 40xx | SM 8.9 | β Supported |
| RTX 20xx | SM 7.5 | β Not supported |
Check your GPU capability:
python -c "import torch; print(torch.cuda.get_device_name(0)); print(torch.cuda.get_device_capability(0))"
β οΈ FlashAttention wheels are CUDA & Python version specific
Recommended environment (tested):
- Python 3.10-12
- CUDA 12.x / 13.0
- PyTorch CUDA build
Install:
pip install flash-attn==2.8.2 --no-build-isolationThen verify:
python - <<EOF
import torch
from flash_attn import flash_attn_func
q = torch.randn(1, 128, 8, 64, device="cuda", dtype=torch.float16)
k = torch.randn_like(q)
v = torch.randn_like(q)
o = flash_attn_func(q, k, v, dropout_p=0.0, causal=False)
print("FlashAttention OK:", o.shape)
EOFIf this test passes, Wan-Faster will work.
You can also use whl from: https://github.com/Goldlionren/AI-windows-whl.git
| Backend | FlashAttention 2 | Notes |
|---|---|---|
wan |
β Not required | Slower, more compatible |
wan_faster |
β Required | Faster, fewer steps |
If your GPU does not support FA2, use:
backend = wan
This JR fork integrates DWPose (ONNX-based) as the default pose extraction backend, replacing the legacy MediaPipe FaceMesh pipeline.
- β Much higher robustness on fast / complex motions
- β Fewer "no pose detected" failures
- β GPU-accelerated via ONNX Runtime (CUDA / TensorRT)
- β Fully independent from PyTorch device placement
- Default backend:
dwpose - Automatic fallback: MediaPipe (if ONNXRuntime or GPU is unavailable)
- Device independence:
- T5 can run on CPU
- DWPose can still run on GPU
- No cross-interference between PyTorch and ONNXRuntime
Place the following ONNX models under:
ComfyUI/models/DreamID-V/pose/models/
βββ dw-ll_ucoco_384.onnx
βββ yolox_l.onnx
JR fork supports automatic download of DWPose ONNX models.
Enable by setting the environment variable:
DREAMIDV_AUTO_DOWNLOAD_DWPOSE=1If disabled, a clear error message will indicate which files are missing and where to place them.
- Supported providers:
CUDAExecutionProviderTensorrtExecutionProvider(if available)CPUExecutionProvider(fallback)
Runtime log example:
[DWPose] det providers : ['CUDAExecutionProvider', 'CPUExecutionProvider']
[DWPose] pose providers: ['CUDAExecutionProvider', 'CPUExecutionProvider']
- Install ComfyUI (Python β₯ 3.10 recommended)
- Clone this repository into:
ComfyUI/custom_nodes/ComfyUI_JR_DreamID-V
- Install dependencies:
pip install -r requirements.txt- Download required DreamID-V and DWPose models
- Launch ComfyUI
π‘ Note:
Selectingcpufor T5 does NOT affect DWPose.
Pose extraction runs via ONNXRuntime and can still use GPU acceleration.
- OS: Windows / Linux
- GPU: NVIDIA (16GB VRAM recommended)
- Python: 3.10+
- PyTorch: CUDA-enabled build
- ONNX Runtime:
onnxruntime(CPU)onnxruntime-gpu(recommended for GPU acceleration)
Compared to the original DreamID-V:
- β DWPose (ONNX) replaces MediaPipe for pose extraction
- β GPU-accelerated pose detection (CUDA / TensorRT)
- β Clear separation of T5 / Pose / UNet devices
- β Improved stability on real-world videos
- Install ComfyUI Manager
- Search for
ComfyUI_JR_DreamID-V - Install
- Navigate to ComfyUI
custom_nodesdirectory:
cd ComfyUI/custom_nodes- Clone the JR fork:
git clone https://github.com/<your-github-username>/ComfyUI_JR_DreamID-V.git- Install dependencies:
cd ComfyUI_JR_DreamID-V
pip install -r requirements.txtModel preparation is identical to the original project.
Download: π€ https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B
Directory layout:
ComfyUI/models/Wan/Wan2.1-T2V-1.3B/
βββ models_t5_umt5-xxl-enc-bf16.pth
βββ Wan2.1_VAE.pth
βββ google/umt5-xxl/
Download: π€ https://huggingface.co/XuGuo699/DreamID-V
Directory:
ComfyUI/models/DreamID-V/
βββ dreamidv.pth
If you plan to use the Wan-Faster backend, an additional model file is required.
Download dreamidv_faster.pth from the official DreamID-V repository:
π https://github.com/bytedance/DreamID-V
(See the Wan-1.3B-Faster section in the upstream README)
β οΈ This repository does NOT redistributedreamidv_faster.pth. Please download it directly from the original authors.
Directory:
ComfyUI/models/DreamID-V/
βββ dreamidv.pth
βββ dreamidv_faster.pth
-
dreamidv_faster.pthis only required when using: backend = wan_faster -
The standard backend (
wan) continues to usedreamidv.pth -
The loader will automatically select the correct checkpoint based on the selected backend
-
Add
JR_DreamID-V_Loader -
Select T5 device:
cuda:1(recommended for dual-GPU)cuda:0cpu(low-VRAM / fallback)
-
Add
JR_DreamID-V_Sampler -
Connect:
pipelinevideoref_image
-
Configure parameters and run
-
GPU:
- β RTX 4060 Ti / RTX 4080 (16GB tested)
- β Dual-GPU setups supported
-
Python: 3.8+
-
CUDA: 11.7+
-
ComfyUI: Latest version
β οΈ Larger VRAM improves performance, but 4090 / 5090 are NOT required.
- Add
JR_DreamID-V_Loader - Add
JR_DreamID-V_SamplerorJR_DreamID-V_LongVideo_Sampler - In the Sampler node:
- Set
backend=wan_faster - Set
sampling_steps= 12 - Ensure
sample_solver= unipc
- Set
- (LongVideo only) Set:
fps = 16(recommended)
- Run the workflow
wan_fasterdoes not use pose reference video internally.- Reference inputs are limited to:
- source video
- face mask video
- reference image
- Progress is reported via ComfyUIβs native green progress bar.
MediaPipe is NOT required when:
- Using DWPose
- Using wan_faster
- Using Python β₯ 3.12 (MediaPipe incompatible)
JR Fork makes MediaPipe optional:
- If MediaPipe is unavailable β automatically skipped
- No impact on DWPose-based workflows
- No impact on Wan-Faster backend
Pose extraction, text encoding, and diffusion are fully decoupled
- T5 can run on CPU
- DWPose runs via ONNXRuntime (GPU)
- Wan-Faster requires FlashAttention 2
- No cross-device interference
- Licensed under Apache License 2.0
- This repository is a fork of:
HM-RunningHub / ComfyUI_RH_DreamID-V
Original copyright belongs to the original authors. All modifications in this repository are made under the terms of Apache-2.0.
- DreamID-V (ByteDance)
- Wan Team
- ComfyUI
- Original RunningHub project authors
Special thanks to the original DreamID-V authors for introducing the Wan-1.3B-Faster model and inference pipeline, which enables significantly faster generation with reduced sampling steps.
This project is for research and educational purposes only. Please comply with local laws and regulations. Do not use this project for illegal activities or rights-infringing purposes.
If you find this JR fork helpful, please consider giving it a β Star!