Skip to content

bytedance/BindWeave

Repository files navigation

BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

arXiv  project page 

BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

Zhaoyang Li 1,2, Dongjun Qian 2, Kai Su 2*, Qishuai Diao 2, Xiangyang Xia 2, Chang Liu 2, Wenfei Yang 1, Tianzhu Zhang 1*, Zehuan Yuan 2

1University of Science and Technology of China 2ByteDance
*Corresponding Author

🔥 News

  • Nov 08, 2025: 🙏 Special thanks to Kijai for adapting ComfyUI for BindWeave and providing an FP8‑quantized Hugging Face model! Feel free to try them out.

  • Nov 04, 2025: 🔥 BindWeave-Wan-14B model is now available at HuggingFace

  • Nov 04, 2025: 🔥 Released code for model inference and training.

🗓️ Todo List

  • Release inference code
  • Release checkpoint of BindWeave_Wan_14B
  • Release training code of BindWeave

📖 Overview

BindWeave is a unified subject-consistent video generation framework for single- and multi-subject prompts, built on an MLLM-DiT architecture that couples a pretrained multimodal large language model with a diffusion transformer. It achieves cross-modal integration via entity grounding and representation alignment, leveraging the MLLM to parse complex prompts and produce subject-aware hidden states that condition the DiT for high-fidelity generation.

🎥 Demo

BindWeave Video Generation Demo

Before running the inference code, you need to download the original 14B model of WanX 2.1. This is crucial because BindWeave depends on its components like the VAE and text encoder.

  1. Download the Pre-trained Model: First, use the Hugging Face CLI to download the model weights. The command below will place them in the ./pretrained_model/wanx/ directory.

    pip install "huggingface_hub[cli]"
    huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P-Diffusers --local-dir ./pretrained_model/wanx/Wan2.1-I2V-14B-720P-Diffusers
  2. Update the Configuration File: After the download is complete, you must update the configuration file at configs/inference/inference_model_s2v.json. Ensure that the paths for the following components correctly point to the directories you just downloaded:

    • vae
    • tokenizer
    • text_encoder
    • image_encoder

Then download the BindWeave model:

huggingface-cli download ByteDance/BindWeave --local-dir ./BindWeave_14B

Weight Conversion

After downloading the BindWeave model, you need to convert the transformer weights to the MM format. Run the conversion script as follows:

python convert_ckpt.py \
  --source_path ./BindWeave_14B/ \
  --target_path ./BindWeave_14B/ \
  --mode convert_to_mm

Run Subject-to-Video Generation

bash script/inference_s2v.sh

You can modify the corresponding paths in 'BindWeave/configs/inference/inference_model_s2v.json', where:

  • BASE_IMG_DIR: Root directory of the reference images.
  • META_PATH: Sample JSON file used during inference.
  • OUT_DIR: Output directory for inference results.

Using the provided sample cases (i.e., the default path configuration), running bash script/inference_s2v.sh will produce the following generated results:

Reference Images Generated Videos (720P)
Image 1 Image 2 Image 2 GIF 1
Image 1 GIF 1
Image 1 Image 1 GIF 1

The GIF videos are compressed.

OpenS2V-Eval Performance 🏆

BindWeave achieves a solid score of 57.61 on the OpenS2V-Eval benchmark, highlighting its robust capabilities across multiple evaluation dimensions and demonstrating competitive performance against several leading open-source and commercial systems.

Model TotalScore↑ AestheticScore↑ MotionSmoothness↑ MotionAmplitude↑ FaceSim↑ GmeScore↑ NexusScore↑ NaturalScore↑
BindWeave 57.61% 45.55% 95.90% 13.91% 53.71% 67.79% 46.84% 66.85%
VACE-14B 57.55% 47.21% 94.97% 15.02% 55.09% 67.27% 44.08% 67.04%
Phantom-14B 56.77% 46.39% 96.31% 33.42% 51.46% 70.65% 37.43% 69.35%
Kling1.6(20250503) 56.23% 44.59% 86.93% 41.6% 40.1% 66.2% 45.89% 74.59%
Phantom-1.3B 54.89% 46.67% 93.3% 14.29% 48.56% 69.43% 42.48% 62.5%
MAGREF-480P 52.51% 45.02% 93.17% 21.81% 30.83% 70.47% 43.04% 66.9%
SkyReels-A2-P14B 52.25% 39.41% 87.93% 25.6% 45.95% 64.54% 43.75% 60.32%
Vidu2.0(20250503) 51.95% 41.48% 90.45% 13.52% 35.11% 67.57% 43.37% 65.88%
Pika2.1(20250503) 51.88% 46.88% 87.06% 24.71% 30.38% 69.19% 45.4% 63.32%
VACE-1.3B 49.89% 48.24% 97.2% 18.83% 20.57% 71.26% 37.91% 65.46%
VACE-P1.3B 48.98% 47.34% 96.8% 12.03% 16.59% 71.38% 40.19% 64.31%

⭐ Citation

If you find BindWeave useful, please consider giving our repository a star (⭐) and citing our paper.

BibTeX

@article{li2025bindweave,
  title={BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration},
  author={Li, Zhaoyang and Qian, Dongjun and Su, Kai and Diao, Qishuai and Xia, Xiangyang and Liu, Chang and Yang, Wenfei and Zhang, Tianzhu and Yuan, Zehuan},
  journal={arXiv preprint arXiv:2510.00438},
  year={2025}
}

About

Official Repo For "BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published