BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration
Zhaoyang Li 1,2, Dongjun Qian 2, Kai Su 2*, Qishuai Diao 2, Xiangyang Xia 2, Chang Liu 2, Wenfei Yang 1, Tianzhu Zhang 1*, Zehuan Yuan 2
1University of Science and Technology of China 2ByteDance
*Corresponding Author
-
Nov 08, 2025: 🙏 Special thanks to Kijai for adapting ComfyUI for BindWeave and providing an FP8‑quantized Hugging Face model! Feel free to try them out.
-
Nov 04, 2025: 🔥 BindWeave-Wan-14B model is now available at HuggingFace
-
Nov 04, 2025: 🔥 Released code for model inference and training.
- Release inference code
- Release checkpoint of BindWeave_Wan_14B
- Release training code of BindWeave
BindWeave is a unified subject-consistent video generation framework for single- and multi-subject prompts, built on an MLLM-DiT architecture that couples a pretrained multimodal large language model with a diffusion transformer. It achieves cross-modal integration via entity grounding and representation alignment, leveraging the MLLM to parse complex prompts and produce subject-aware hidden states that condition the DiT for high-fidelity generation.
Before running the inference code, you need to download the original 14B model of WanX 2.1. This is crucial because BindWeave depends on its components like the VAE and text encoder.
-
Download the Pre-trained Model: First, use the Hugging Face CLI to download the model weights. The command below will place them in the
./pretrained_model/wanx/directory.pip install "huggingface_hub[cli]" huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P-Diffusers --local-dir ./pretrained_model/wanx/Wan2.1-I2V-14B-720P-Diffusers -
Update the Configuration File: After the download is complete, you must update the configuration file at
configs/inference/inference_model_s2v.json. Ensure that the paths for the following components correctly point to the directories you just downloaded:vaetokenizertext_encoderimage_encoder
Then download the BindWeave model:
huggingface-cli download ByteDance/BindWeave --local-dir ./BindWeave_14BAfter downloading the BindWeave model, you need to convert the transformer weights to the MM format. Run the conversion script as follows:
python convert_ckpt.py \
--source_path ./BindWeave_14B/ \
--target_path ./BindWeave_14B/ \
--mode convert_to_mm
Run Subject-to-Video Generation
bash script/inference_s2v.shYou can modify the corresponding paths in 'BindWeave/configs/inference/inference_model_s2v.json', where:
BASE_IMG_DIR: Root directory of the reference images.META_PATH: Sample JSON file used during inference.OUT_DIR: Output directory for inference results.
Using the provided sample cases (i.e., the default path configuration), running bash script/inference_s2v.sh will produce the following generated results:
| Reference Images | Generated Videos (720P) |
|---|---|
|
|
|
|
|
|
|
|
|
The GIF videos are compressed.
BindWeave achieves a solid score of 57.61 on the OpenS2V-Eval benchmark, highlighting its robust capabilities across multiple evaluation dimensions and demonstrating competitive performance against several leading open-source and commercial systems.
| Model | TotalScore↑ | AestheticScore↑ | MotionSmoothness↑ | MotionAmplitude↑ | FaceSim↑ | GmeScore↑ | NexusScore↑ | NaturalScore↑ |
|---|---|---|---|---|---|---|---|---|
| BindWeave | 57.61% | 45.55% | 95.90% | 13.91% | 53.71% | 67.79% | 46.84% | 66.85% |
| VACE-14B | 57.55% | 47.21% | 94.97% | 15.02% | 55.09% | 67.27% | 44.08% | 67.04% |
| Phantom-14B | 56.77% | 46.39% | 96.31% | 33.42% | 51.46% | 70.65% | 37.43% | 69.35% |
| Kling1.6(20250503) | 56.23% | 44.59% | 86.93% | 41.6% | 40.1% | 66.2% | 45.89% | 74.59% |
| Phantom-1.3B | 54.89% | 46.67% | 93.3% | 14.29% | 48.56% | 69.43% | 42.48% | 62.5% |
| MAGREF-480P | 52.51% | 45.02% | 93.17% | 21.81% | 30.83% | 70.47% | 43.04% | 66.9% |
| SkyReels-A2-P14B | 52.25% | 39.41% | 87.93% | 25.6% | 45.95% | 64.54% | 43.75% | 60.32% |
| Vidu2.0(20250503) | 51.95% | 41.48% | 90.45% | 13.52% | 35.11% | 67.57% | 43.37% | 65.88% |
| Pika2.1(20250503) | 51.88% | 46.88% | 87.06% | 24.71% | 30.38% | 69.19% | 45.4% | 63.32% |
| VACE-1.3B | 49.89% | 48.24% | 97.2% | 18.83% | 20.57% | 71.26% | 37.91% | 65.46% |
| VACE-P1.3B | 48.98% | 47.34% | 96.8% | 12.03% | 16.59% | 71.38% | 40.19% | 64.31% |
If you find BindWeave useful, please consider giving our repository a star (⭐) and citing our paper.
@article{li2025bindweave,
title={BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration},
author={Li, Zhaoyang and Qian, Dongjun and Su, Kai and Diao, Qishuai and Xia, Xiangyang and Liu, Chang and Yang, Wenfei and Zhang, Tianzhu and Yuan, Zehuan},
journal={arXiv preprint arXiv:2510.00438},
year={2025}
}