BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

Zhaoyang Li^1,2, Dongjun Qian², Kai Su^2*, Qishuai Diao², Xiangyang Xia², Chang Liu², Wenfei Yang¹, Tianzhu Zhang^1*, Zehuan Yuan²

¹University of Science and Technology of China ²ByteDance
^*Corresponding Author

🔥 News

Nov 08, 2025: 🙏 Special thanks to Kijai for adapting ComfyUI for BindWeave and providing an FP8‑quantized Hugging Face model! Feel free to try them out.
Nov 04, 2025: 🔥 BindWeave-Wan-14B model is now available at HuggingFace
Nov 04, 2025: 🔥 Released code for model inference and training.

🗓️ Todo List

Release inference code
Release checkpoint of BindWeave_Wan_14B
Release training code of BindWeave

📖 Overview

BindWeave is a unified subject-consistent video generation framework for single- and multi-subject prompts, built on an MLLM-DiT architecture that couples a pretrained multimodal large language model with a diffusion transformer. It achieves cross-modal integration via entity grounding and representation alignment, leveraging the MLLM to parse complex prompts and produce subject-aware hidden states that condition the DiT for high-fidelity generation.

🎥 Demo

Before running the inference code, you need to download the original 14B model of WanX 2.1. This is crucial because BindWeave depends on its components like the VAE and text encoder.

Download the Pre-trained Model: First, use the Hugging Face CLI to download the model weights. The command below will place them in the ./pretrained_model/wanx/ directory.
```
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P-Diffusers --local-dir ./pretrained_model/wanx/Wan2.1-I2V-14B-720P-Diffusers
```
Update the Configuration File: After the download is complete, you must update the configuration file at configs/inference/inference_model_s2v.json. Ensure that the paths for the following components correctly point to the directories you just downloaded:
- vae
- tokenizer
- text_encoder
- image_encoder

Then download the BindWeave model:

huggingface-cli download ByteDance/BindWeave --local-dir ./BindWeave_14B

Weight Conversion

After downloading the BindWeave model, you need to convert the transformer weights to the MM format. Run the conversion script as follows:

python convert_ckpt.py \
  --source_path ./BindWeave_14B/ \
  --target_path ./BindWeave_14B/ \
  --mode convert_to_mm

Run Subject-to-Video Generation

bash script/inference_s2v.sh

You can modify the corresponding paths in 'BindWeave/configs/inference/inference_model_s2v.json', where:

BASE_IMG_DIR: Root directory of the reference images.
META_PATH: Sample JSON file used during inference.
OUT_DIR: Output directory for inference results.

Using the provided sample cases (i.e., the default path configuration), running bash script/inference_s2v.sh will produce the following generated results:

Reference Images	Generated Videos (720P)

The GIF videos are compressed.

OpenS2V-Eval Performance 🏆

BindWeave achieves a solid score of 57.61 on the OpenS2V-Eval benchmark, highlighting its robust capabilities across multiple evaluation dimensions and demonstrating competitive performance against several leading open-source and commercial systems.

Model	TotalScore↑	AestheticScore↑	MotionSmoothness↑	MotionAmplitude↑	FaceSim↑	GmeScore↑	NexusScore↑	NaturalScore↑
BindWeave	57.61%	45.55%	95.90%	13.91%	53.71%	67.79%	46.84%	66.85%
VACE-14B	57.55%	47.21%	94.97%	15.02%	55.09%	67.27%	44.08%	67.04%
Phantom-14B	56.77%	46.39%	96.31%	33.42%	51.46%	70.65%	37.43%	69.35%
Kling1.6(20250503)	56.23%	44.59%	86.93%	41.6%	40.1%	66.2%	45.89%	74.59%
Phantom-1.3B	54.89%	46.67%	93.3%	14.29%	48.56%	69.43%	42.48%	62.5%
MAGREF-480P	52.51%	45.02%	93.17%	21.81%	30.83%	70.47%	43.04%	66.9%
SkyReels-A2-P14B	52.25%	39.41%	87.93%	25.6%	45.95%	64.54%	43.75%	60.32%
Vidu2.0(20250503)	51.95%	41.48%	90.45%	13.52%	35.11%	67.57%	43.37%	65.88%
Pika2.1(20250503)	51.88%	46.88%	87.06%	24.71%	30.38%	69.19%	45.4%	63.32%
VACE-1.3B	49.89%	48.24%	97.2%	18.83%	20.57%	71.26%	37.91%	65.46%
VACE-P1.3B	48.98%	47.34%	96.8%	12.03%	16.59%	71.38%	40.19%	64.31%

⭐ Citation

If you find BindWeave useful, please consider giving our repository a star (⭐) and citing our paper.

BibTeX

@article{li2025bindweave,
  title={BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration},
  author={Li, Zhaoyang and Qian, Dongjun and Su, Kai and Diao, Qishuai and Xia, Xiangyang and Liu, Chang and Yang, Wenfei and Zhang, Tianzhu and Yuan, Zehuan},
  journal={arXiv preprint arXiv:2510.00438},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
Megatron-LM @ baf94af		Megatron-LM @ baf94af
MindSpeed @ 4d90410		MindSpeed @ 4d90410
MindSpeed-MM @ c988569		MindSpeed-MM @ c988569
assets		assets
configs/inference		configs/inference
s2v		s2v
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
build_env.sh		build_env.sh
convert_ckpt.py		convert_ckpt.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

🔥 News

🗓️ Todo List

📖 Overview

🎥 Demo

Weight Conversion

OpenS2V-Eval Performance 🏆

⭐ Citation

BibTeX

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

bytedance/BindWeave

Folders and files

Latest commit

History

Repository files navigation

BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

🔥 News

🗓️ Todo List

📖 Overview

🎥 Demo

Weight Conversion

OpenS2V-Eval Performance 🏆

⭐ Citation

BibTeX

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages