A full-stack framework and tutorial for newcomers, rather than a specific model.
minWM is our contribution to the world-model community: a full-stack open-source framework that walks you end-to-end through turning a bidirectional T2V foundation model into an action-conditioned video world model — with example data, runnable scripts, Claude skills capturing our hands-on experience, and onboarding knowledge for newcomers. We hope more researchers and developers join us in growing the community together.
v2_web.mp4
- 2026-05-29 🚀 We release the technical report.
- 2026-05-17 🚀 We release minWM — the first full-stack open-source world model framework.
- 🎬 Demo
- 🔥 News
- ✨ Why minWM?
- 🛠️ Installation
- 🧱 Model Checkpoints
- 🚀 Quick Start
- ⚙️ Data & Training & Reproduction
- 📚 Citation
- Contact
- 🙏 Acknowledgements
The complete data → training → inference pipeline is open-sourced; every stage exposes input/output checkpoints so you can stop, swap, or fork anywhere.
1.1 Data. We walk you through how to construct training-ready datasets paired with camera poses, and the full data processing pipeline that turns them into latents.
1.2 Training. Including FSDP + sequence parallelism, single-/multi-node training, and the full distillation pipeline from a bidirectional diffusion model to a 4-step AR student:
Phase 1 Phase 2 — Distillation to Causal Few-Step
───────────────────── ────────────────────────────────────────────
Bidirectional SFT ──▶ Stage 1 Teacher Forcing AR Diffusion
Stage 2a Causal ODE (proposed in [Causal Forcing](https://arxiv.org/abs/2602.02214))
Stage 2b Causal CD (proposed in [Causal Forcing++](https://arxiv.org/abs/2605.15141))
Stage 3 Asymmetric DMD with Self Rollout
▼
4-step real-time
1.3 Inference.
- ✅ 4-step DMD inference for HY Action2V / HY TI2V / Wan Action2V, multi-GPU sequence parallelism, camera-trajectory control via pose strings (
"a*4,w*8,s*7") or JSON files - 🚧 Inference acceleration [TBD]
minWM supports two paths to arriving at an interactive world model.
The HunyuanVideo 1.5 and Wan 2.1 lines walk through the full 4-stage pipeline — starting from a bidirectional T2V foundation model and ending at a 4-step autoregressive world model.
| Backbone | Architecture | Params | Training | Inference |
|---|---|---|---|---|
| Wan 2.1 | Cross-attention + DiT | 1.3 B | ✅ all 4 stages | ✅ 4-step DMD |
| HunyuanVideo 1.5 | MMDiT | 8 B | ✅ all 4 stages | ✅ 4-step DMD |
Both lines share the same trainer / loss / dataset abstractions, so adding a third backbone is structurally a wrapper-and-config exercise.
The forthcoming worldplay-finetune entry will let you start from an already-trained video world model and adapt it to new conditions, scenes, or resolutions — without rerunning the 4-stage pipeline from scratch.
We aim to support both multiple condition types and multiple injection methods, mixable along either axis.
- ✅ Camera pose
- 🚧 Human pose [TBD]
- ✅ ProPE
- 🚧 Latent concat [TBD]
- 🚧 Cross-attention [TBD]
We are packaging our project experience across the CF / CF++ pipeline as Claude skills, so that an LLM assistant can help users debug failures and integrate new models without reverse-engineering the whole repo.
- 🐛
debug-world-model— collected failure modes from the training pipeline (loss NaN, frame-to-frame jitter, camera drift, memory attenuation, distillation collapse, …). Claude diagnoses likely root causes from your symptoms instead of guessing. - 🔌
integrate-new-backbone— step-by-step recipe for plugging a new video DiT into minWM, grounded in the HunyuanVideo and Wan reference integrations — e.g. "look at how HY does teacher forcing here, do the same for your model there".
onboarding-world-model
A third Claude skill aimed at researchers entering the world-model space for the first time. Two parts:
- 🎓 Foundations — the minimal background to follow the pipeline: Teacher Forcing for AR diffusion training and Causal Forcing & Causal Forcing++ for AR diffusion distillation.
- 🪤 Pitfalls — the non-obvious mistakes we hit while building minWM, distilled so you don't repeat them.
Intended audience: graduate students, independent researchers, and junior labs that want to enter the world-model space without spending three months reverse-engineering existing repos.
conda create -n minwm python=3.10 -y
conda activate minwm
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
export PYTHONPATH="$PWD/HY15:$PWD/Wan21:$PWD/shared:$PYTHONPATH"🧱 Model Checkpoints (Click to expand)
All weights live under ./ckpts/ after download.
| Checkpoint | Backbone | Stage | Use case | Download |
|---|---|---|---|---|
Wan21/Action2V/{bidirectional,ar_diffusion_tf,causal_ode,causal_cd,dmd} |
Wan 2.1 | Same 4 stages | Wan pipeline | HF |
HunyuanVideo-1.5 (base) |
HY 1.5 | — | Required by both HY pipelines | HF |
Wan2.1-T2V-1.3B (base) |
Wan 2.1 | — | Required by Wan pipeline | HF |
HY15/Action2V/bidirectional |
HY 1.5 | Phase 1 SFT | Starting point for HY Action2V Phase 2 | HF |
HY15/Action2V/ar_diffusion_tf |
HY 1.5 | Phase 2 Stage 1 | Teacher Forcing AR diffusion | HF |
HY15/Action2V/causal_ode |
HY 1.5 | Phase 2 Stage 2a (proposed in Causal Forcing) | DMD initialization | HF |
HY15/Action2V/causal_cd |
HY 1.5 | Phase 2 Stage 2b (proposed in Causal Forcing++) | DMD initialization | HF |
HY15/Action2V/dmd |
HY 1.5 | Phase 2 Stage 3 | 4-step real-time inference | HF |
HY15/TI2V/{bidirectional,ar_diffusion_tf,causal_ode,causal_cd,dmd} |
HY 1.5 | Same 4 stages, TI2V variant | TI2V pipeline | HF |
The fastest path: install → download three DMD checkpoints → run three demo commands. Full reproduction (all 4 training stages × 3 model lines) is in § Data & Training & Reproduction.
# Wan base (T2V-1.3B)
hf download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./ckpts/Wan2.1-T2V-1.3B
# Code hardcodes the load path; create a symlink.
mkdir -p Wan21/wan_models
ln -s "$(realpath ./ckpts/Wan2.1-T2V-1.3B)" Wan21/wan_models/Wan2.1-T2V-1.3B
# HY base + text/vision encoders (required by HY pipelines)
hf download tencent/HunyuanVideo-1.5 --local-dir ./ckpts/HunyuanVideo-1.5 \
--include "vae/*" "scheduler/*" "transformer/480p_i2v/*"
hf download Qwen/Qwen2.5-VL-7B-Instruct --local-dir ./ckpts/HunyuanVideo-1.5/text_encoder/llm
hf download google/byt5-small --local-dir ./ckpts/HunyuanVideo-1.5/text_encoder/byt5-small
modelscope download --model AI-ModelScope/Glyph-SDXL-v2 \
--local_dir ./ckpts/HunyuanVideo-1.5/text_encoder/Glyph-SDXL-v2
hf download black-forest-labs/FLUX.1-Redux-dev \
--local-dir ./ckpts/HunyuanVideo-1.5/vision_encoder/siglip --token <your_hf_token>
# 4-step DMD checkpoints
## Wan Action2V (DMD, 4-step)
hf download MIN-Lab/minWM --local-dir ./ckpts \
--include "Wan21/Action2V/dmd/*"
## HY Action2V (DMD, 4-step, worldplay teacher)
hf download MIN-Lab/minWM --local-dir ./ckpts \
--include "HY15/Action2V/dmd/*"
# HY Action2V (DMD, 4-step, our bidirectional teacher)
# hf download MIN-Lab/minWM --local-dir ./ckpts \
# --include "HY15/Action2V/dmd_ourbi/*"
## HY TI2V (DMD, 4-step)
hf download MIN-Lab/minWM --local-dir ./ckpts \
--include "HY15/TI2V/dmd/*"# 2.1 Wan Action2V (4-step DMD, camera control)
OUTPUT_FOLDER=./outputs/quickstart_wan_action2v \
TRAJECTORY_PATH="Wan21/prompts/trajectories.txt" \
bash Wan21/scripts/inference/run_infer_causal_camera.sh
# 2.2 HY Action2V (4-step DMD, camera control)
TRANSFORMER_DIR=./ckpts/HY15/Action2V/dmd \
OUTPUT_DIR=./outputs/quickstart_hy_action2v \
bash HY15/scripts/inference/run_infer_causal_camera.sh
# 2.3 HY TI2V (4-step DMD)
TRANSFORMER_DIR=./ckpts/HY15/TI2V/dmd \
OUTPUT_DIR=./outputs/quickstart_hy_ti2v \
bash HY15/scripts/inference/run_infer_causal.sh
Camera control. For HY Action2V, trajectories are read per-sample from
assets/example.jsonunder the"trajectory"field. Format:w/s/a/dkeys with*Nrepeats; comma-separated segments — e.g."a*4,w*8,s*7".
Three model lines × two phases × four stages, each documented as (1) Model download → (2) Data preparation → (3) Training script → (4) Validation. Full reproduction guides are split by backbone:
- 📗
training_wan.md- Wan Action2V (Wan 2.1 backbone)
- 📘
training_hunyuan.md— HY Action2V (HY1.5-8B backbone)- HY TI2V (HY1.5-8B backbone)
If minWM helps your research, please cite:
# ICML 2026
@article{zhu2026causal,
title={Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation},
author={Zhu, Hongzhou and Zhao, Min and He, Guande and Su, Hang and Li, Chongxuan and Zhu, Jun},
journal={arXiv preprint arXiv:2602.02214},
year={2026}
}
# Technical Report
@article{zhao2026causal,
title={Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation},
author={Zhao, Min and Zhu, Hongzhou and Zheng, Kaiwen and Zhou, Zihan and Yan, Bokai and Li, Xinyuan and Yang, Xiao and Li, Chongxuan and Zhu, Jun},
journal={arXiv preprint arXiv:2605.15141},
year={2026}
}
# Technical Report
@article{zhao2026minwm,
title={minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models},
author={Zhao, Min and Zhu, Hongzhou and Yan, Bokai and Zhou, Zihan and Chen, Yimin and Sun, Wenqiang and Zheng, Kaiwen and He, Guande and Yang, Xiao and Li, Chongxuan and others},
journal={arXiv preprint arXiv:2605.30263},
year={2026}
}
For questions, suggestions, or collaboration, please open a GitHub issue or contact: gracezhao1997@gmail.com.
minWM stands on the shoulders of giants. We thank the authors and maintainers of HunyuanVideo 1.5, HY-WorldPlay, Wan 2.1, Causal-Forcing, and FastVideo for their open-source contributions, which made this framework possible.