ACWM-Phys

ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

Project Page · arXiv · Dataset · Checkpoints

Haotian Xue†, Yipu Chen*, Liqian Ma*, Zelin Zhao, Lama Moukheiber, Yuchen Zhu, Yongxin Chen

Georgia Institute of Technology (†project lead, *equal contribution)

Overview

ACWM-Phys is a benchmark for evaluating action-conditioned video world models under diverse physical dynamics. It spans 8 environments across 4 physics regimes:

Category	Environments
Rigid-Body	Push Cube, Stack Cube
Deformable	Push Rope, Cloth Move
Particle	Push Sand, Pour Water
Kinematics	Robot Arm, Reacher

Each environment provides 1,000 training trajectories + controlled in-distribution (InD) and out-of-distribution (OoD) test splits. We also provide ACWM-DiT, a latent diffusion transformer baseline trained with flow matching.

Installation

We use uv for fast, reproducible environment management.

git clone https://github.com/xavihart/ACWM-Phys.git
cd ACWM-Phys

# Create and activate a virtual environment
uv venv --python 3.10
source .venv/bin/activate

# Install dependencies
uv pip install -r requirements.txt

# Flash Attention (recommended for speed)
uv pip install flash-attn --no-build-isolation

Dataset

Download the ACWM-Phys dataset from HuggingFace:

huggingface-cli download t1an/ACWM-Phys --repo-type dataset --local-dir ./data

Then set the data root:

export ACWM_DATA_ROOT=./data

Expected structure:

data/
├── rigid_dynamics/
│   ├── push_block/      {ind_train, ind_test, ood_test}/
│   └── stack_cube/
├── deformable/
│   ├── push_rope/
│   └── clothmove/
├── particle/
│   ├── push_sand/
│   └── pour_water/
└── kinematics/
    ├── robot_arm_64/
    └── reacher/

Dataset Format

Each split directory (e.g. push_block/ind_train/) contains:

episode_{i}.mp4 — RGB video at 10 fps, 240×240 (240×400 for Push Sand)
metadata.pt — serialized list of episode dicts (load with torch.load)

Each entry in metadata.pt has:

Field	Type	Description
`video_path`	`str`	Filename relative to the split dir, e.g. `episode_0.mp4`
`actions`	`FloatTensor [T, action_dim]`	Per-step action sequence
`length`	`int`	Number of frames T
`seed`	`int`	Random seed used during simulation
`episode_idx`	`int`	Global episode index (some environments)

Example:

import torch

metadata = torch.load("data/rigid_dynamics/push_block/ind_train/metadata.pt", weights_only=False)
entry = metadata[0]
# entry["video_path"]  → "episode_0.mp4"
# entry["actions"]     → Tensor of shape [T, 2]
# entry["length"]      → 16

Checkpoints

Download the pretrained DiT-S checkpoints (100k steps) and the Wan 2.1 VAE:

huggingface-cli download t1an/ACWM-Phys-checkpoints --local-dir ./checkpoints

Set the VAE path:

export WAN_VAE_PATH=./checkpoints/Wan2.1_VAE.pth

The env configs in configs/envs/ also reference WAN_VAE_PATH via the vae_config field.

Released Checkpoints

All checkpoints are DiT-S (~200M parameters), trained for 100k steps with flow matching.

Environment	Category	Action Dim	Resolution	Checkpoint
Push Cube	Rigid-Body	2	240×240	link
Stack Cube	Rigid-Body	7	240×240	link
Push Rope	Deformable	2	240×240	link
Cloth Move	Deformable	3	240×240	link
Push Sand	Particle	7	240×400	link
Pour Water	Particle	4	240×240	link
Robot Arm	Kinematics	7	240×240	link
Reacher	Kinematics	2	240×240	link

Evaluation

Evaluate a single environment:

python eval.py --env push_cube --steps 50 --split both --save_videos

Evaluate all 8 environments:

bash scripts/eval_all.sh --save_videos

Results are written to results/results.md. Videos are saved under results/{env}/steps_50/{split}/sample_{i}/video.mp4 as side-by-side GT (left) | Prediction (right).

Key arguments:

Argument	Default	Description
`--env`	required	Environment name
`--steps`	50	Denoising steps
`--split`	both	`ind_test`, `ood_test`, or `both`
`--ckpt`	auto	Override checkpoint path
`--cfg`	auto	Override config path
`--save_videos`	off	Save GT\|Pred side-by-side videos

Training

Train DiT-S on Push Cube (single GPU):

python train.py --config configs/envs/push_cube.yaml

Multi-GPU (4 GPUs):

torchrun --nproc_per_node=4 train.py --config configs/envs/push_cube.yaml

SLURM example:

sbatch scripts/train_slurm.sh push_cube

Training hyperparameters are in configs/envs/{env}.yaml. Model size (S/M/L) is set via model_type: dit_s in the config.

Model Architecture

ACWM-DiT takes the first video frame + full action sequence and predicts the complete future trajectory:

Causal VAE (Wan 2.1) — encodes video into 16-ch latent tokens at H/8×W/8, 4× temporal compression
DiT with flow matching — denoises the full latent trajectory; supports AdaLN and cross-attention action conditioning
Action conditioning — injected via AdaLN (default) or cross-attention (better for high-dim actions)

Three model sizes: DiT-S (~200M), DiT-M (~600M), DiT-L (~800M).

Metrics

Metric	Description
MSE	Mean squared error on pixel values in [0,1]
M-MSE	Motion-weighted MSE (floor 0.01; focuses on moving regions)
PSNR	Peak signal-to-noise ratio (dB)
SSIM	Structural similarity index

Citation

@article{xue2026acwm,
  title={ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models},
  author={Xue, Haotian and Chen, Yipu and Ma, Liqian and Zhao, Zelin and Moukheiber, Lama and Zhu, Yuchen and Che, Yongxin},
  journal={arXiv preprint arXiv:2605.08567},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ACWM-Phys

Overview

Installation

Dataset

Dataset Format

Checkpoints

Released Checkpoints

Evaluation

Training

Model Architecture

Metrics

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
acwm		acwm
assets		assets
configs		configs
scripts		scripts
README.md		README.md
eval.py		eval.py
requirements.txt		requirements.txt
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

ACWM-Phys

Overview

Installation

Dataset

Dataset Format

Checkpoints

Released Checkpoints

Evaluation

Training

Model Architecture

Metrics

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages