iMac: Translating Actions into Motion and Contact Images for Embodied World Models

Project Page | Data

We present iMaC, an action-conditioned video generation model (world model) for embodied evaluation. We convert action controls into representative images for strong action following and dynamics modeling.

Demo

World model as evaluator:

Repository Layout

This repository contains the training, preprocessing, and inference code for three iMaC workflows:

RND-mix stage-one training and evaluation.
RND-mix stage-two training and evaluation with replay and 3D conditions.
WorldArena preprocessing, 3D-condition training, and validation inference.

This README is the reproducible entry point for running the released code.

iMac/
  configs/       Training configurations
  models/        Wan transformer variants
  pipelines/     Diffusion inference pipelines
  trainer/       RND-mix and WorldArena trainers
  transforms/    Dataset transforms
scripts/
  launch_train.py
  inference_rnd_mix_stage_one.py
  inference_rnd_mix_stage_two.py
  inference_worldarena_val.py
  process_depth_da3_worldarena.py
  process_depth_da3_worldarena_testdata.py
  process_3d_condition.py
  process_3d_condition_worldarena_testdata.py
third_party/
  giga-train/
  giga-datasets/
  giga-models/

Installation

The code targets Python 3.11 and CUDA-capable PyTorch.

conda create -n giga_torch python=3.11.10
conda activate giga_torch

pip install -e third_party/giga-train
pip install -e third_party/giga-datasets
pip install -e third_party/giga-models

Install the remaining runtime dependencies used by your workflow, including PyTorch, Diffusers, Accelerate, Decord, OpenCV, ImageIO, HDF5, Blosc, SciPy, Trimesh, Robotics Toolbox, and the depth_anything_3 package.

For online RND evaluation, install RobotWin 2.0 separately and configure the simulator assets required by simulator/script/run_simulator_server.py.

Path Configuration

No user-specific absolute path is required by the released configs. Set paths with environment variables:

export CVPR2026_WM_DATA_DIR=/path/to/CVPR-2026-WorldModel-Track-Dataset
export CVPR2026_WM_MODEL_CACHE=/path/to/pretrained_models
export CVPR2026_WM_OUTPUT_DIR=/path/to/outputs

export DA3_MODEL_PATH=/path/to/DA3/model
export PIPER_URDF_PATH=/path/to/piper.urdf
export PIPER_GRIPPER_MESH_DIR=/path/to/piper/meshes

Optional variables:

Variable	Purpose	Default
`CVPR2026_WM_PROMPT_EMBEDDING`	Prompt embedding used by WorldArena	`asserts/default_prompt_embeds.pth`
`RND_MIX_STAGE_ONE_OUTPUT_DIR`	Stage-one experiment directory	`$CVPR2026_WM_OUTPUT_DIR/rnd_mix_stage_one_alltask`
`RND_MIX_STAGE_TWO_OUTPUT_DIR`	Stage-two experiment directory	`$CVPR2026_WM_OUTPUT_DIR/rnd_mix_stage_two_alltask`
`RND_MIX_STAGE_ONE_CHECKPOINT`	Stage-one transformer used to initialize stage two	no default; required for stage two
`WORLDARENA_DATA_DIR`	WorldArena root containing `train/` and `val/`	`data/WorldArena`
`WORLDARENA_OUTPUT_DIR`	WorldArena experiment directory	`$CVPR2026_WM_OUTPUT_DIR/worldarena_3d`
`WORLDARENA_URDF_PATH`	WorldArena robot URDF	falls back to `PIPER_URDF_PATH`
`WORLDARENA_GRIPPER_MESH_DIR`	WorldArena robot meshes	falls back to `PIPER_GRIPPER_MESH_DIR`
`TORCH_MP_SHARING_STRATEGY`	PyTorch multiprocessing sharing mode	`file_descriptor`

See .env.example for a complete template. Model paths in iMac/model_config.py are resolved under CVPR2026_WM_MODEL_CACHE.

Download the configured pretrained models with:

python scripts/download_pretrained_models.py
python scripts/download_gigabrain_policy.py

Dataset Preparation

CVPR World Model Dataset

Download the dataset and pack the training split:

python scripts/pack_training_data.py \
  --data_dir "$CVPR2026_WM_DATA_DIR" \
  --task all

RND-mix expects three synchronized RGB videos, three DA3 metric-depth videos, simulator replay videos, and qpos. DA3 depth videos are uint16 millimeters; the transform converts them back to meters before normalization.

The RND-mix configs use:

num_frames = sub_frames * rollout + 1 = 8 * 4 + 1 = 33
depth range = [0.08, 1.2] meters
depth RGB encoding = vision_banana

WorldArena

The packed WorldArena samples used for training and validation must contain:

gt_video_path
condition_video_path
scene_3d_condition_path
replay_3d_condition_path
task_name
episode_name

Generate the reference depth:

python scripts/process_depth_da3_worldarena.py \
  --data_root_path "$WORLDARENA_DATA_DIR/train" \
  --device cuda:0 \
  --da3_model_path "$DA3_MODEL_PATH" \
  --urdf_path "$WORLDARENA_URDF_PATH" \
  --gripper_mesh_dir "$WORLDARENA_GRIPPER_MESH_DIR" \
  --only_ref_frame

Generate replay and scene 3D-condition videos:

python scripts/process_3d_condition.py \
  --data_root "$WORLDARENA_DATA_DIR/train" \
  --robot_type agilex \
  --urdf "$WORLDARENA_URDF_PATH" \
  --mesh_dir "$WORLDARENA_GRIPPER_MESH_DIR"

Repeat both commands for val/ when validation samples have not been preprocessed.

RND-Mix Stage One

Stage one trains on vertically stacked RGB and metric-depth frames. Replay images are the only condition branch.

Train

python scripts/launch_train.py --preset baseline_rnd_mix_stage_one_alltask

The main config is:

iMac/configs/
  baseline_wm_rnd_mix_stage_one_alltask.py

Offline Test

python scripts/inference_rnd_mix_stage_one.py \
  --transformer_model_path /path/to/stage_one/checkpoint/transformer \
  --device_list 0 \
  --mode offline \
  --task task1 \
  --data_dir "$CVPR2026_WM_DATA_DIR" \
  --output_dir "$CVPR2026_WM_OUTPUT_DIR/rnd_mix_stage_one_eval"

The inference defaults match the released training config: vision_banana, depth range [0.08, 1.2], and square-root normalization.

RND-Mix Stage Two

Stage two initializes from stage one and adds replay_3d_condition and scene_3d_condition. The replay image branch remains active.

Train

export RND_MIX_STAGE_ONE_CHECKPOINT=/path/to/stage_one/checkpoint/transformer

python scripts/launch_train.py --preset baseline_rnd_mix_stage_two_alltask

The main config is:

iMac/configs/
  baseline_wm_rnd_mix_stage_two_alltask.py

Stage-two 3D-condition generation needs the Piper URDF and mesh paths even when DA3 loading is disabled, because robot forward kinematics is still used.

Offline Test

python scripts/inference_rnd_mix_stage_two.py \
  --transformer_model_path /path/to/stage_two/checkpoint/transformer \
  --device_list 0 \
  --mode offline \
  --task task1 \
  --data_dir "$CVPR2026_WM_DATA_DIR" \
  --output_dir "$CVPR2026_WM_OUTPUT_DIR/rnd_mix_stage_two_eval" \
  --da3_model_path "$DA3_MODEL_PATH" \
  --da3_urdf_path "$PIPER_URDF_PATH" \
  --da3_gripper_mesh_dir "$PIPER_GRIPPER_MESH_DIR"

Use --max_episodes N for a smoke test. Existing non-empty episode videos are skipped by default; pass --no-skip_existing to regenerate them.

Online Test

Start the simulator server:

python simulator/script/run_simulator_server.py \
  --host_port 9151 \
  --save_tag sim9151

Then run:

python scripts/inference_rnd_mix_stage_two.py \
  --transformer_model_path /path/to/stage_two/checkpoint/transformer \
  --device_list 0 \
  --mode online \
  --task task1 \
  --policy_ckpt_dir /path/to/policy \
  --policy_norm_stats_path /path/to/policy/norm_stat_gigabrain.json \
  --simulator_ip 127.0.0.1 \
  --simulator_port 9151 \
  --output_dir "$CVPR2026_WM_OUTPUT_DIR/rnd_mix_stage_two_online"

WorldArena 3D

The released configuration is:

iMac/configs/0501_worldarena_3d_r1c120.py

It trains WorldArena3DTrainer with replay-image, scene-3D, and replay-3D conditions.

Train

python scripts/launch_train.py --preset worldarena_3d

Validation Inference

python scripts/inference_worldarena_val.py \
  --transformer_model_path /path/to/worldarena/checkpoint/transformer \
  --config_path iMac.configs.0501_worldarena_3d_r1c120.config \
  --dataset_dir "$WORLDARENA_DATA_DIR/val" \
  --save_dir "$CVPR2026_WM_OUTPUT_DIR/worldarena_val" \
  --gpu_ids 0

Use --episodes 40,41,49 to evaluate selected episodes.

Test-Set Inference

For the official test layout with first_frame/fixed_scene_task/*.png, use the testdata preprocessors:

python scripts/process_depth_da3_worldarena_testdata.py \
  --data_root_path /path/to/worldarena_test \
  --device cuda:0 \
  --da3_model_path "$DA3_MODEL_PATH" \
  --urdf_path "$WORLDARENA_URDF_PATH" \
  --gripper_mesh_dir "$WORLDARENA_GRIPPER_MESH_DIR"

python scripts/process_3d_condition_worldarena_testdata.py \
  --data_root /path/to/worldarena_test \
  --robot_type agilex \
  --urdf "$WORLDARENA_URDF_PATH" \
  --mesh_dir "$WORLDARENA_GRIPPER_MESH_DIR"

Then run:

python scripts/inference_worldarena_test.py \
  --transformer_model_path /path/to/worldarena/checkpoint/transformer \
  --config_path iMac.configs.0501_worldarena_3d_r1c120.config \
  --dataset_paths /path/to/worldarena_test \
  --save_dir "$CVPR2026_WM_OUTPUT_DIR/worldarena_test" \
  --gpu_ids 0

Validation

The repository has no root unit-test suite. Before a full training run:

python -m compileall \
  iMac \
  scripts \
  scripts/process_depth_da3_worldarena.py \
  scripts/process_3d_condition.py

python scripts/launch_train.py --help
python scripts/inference_rnd_mix_stage_one.py --help
python scripts/inference_rnd_mix_stage_two.py --help
python scripts/inference_worldarena_val.py --help

For config changes, verify that:

num_frames = sub_frames * rollout + 1.
Stage-two depth encoding matches stage one.
RND_MIX_STAGE_ONE_CHECKPOINT resolves to a transformer directory.
WorldArena samples include all three condition paths.
qpos samples stay aligned with video/depth samples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

iMac: Translating Actions into Motion and Contact Images for Embodied World Models

Project Page | Data

Demo

Repository Layout

Installation

Path Configuration

Dataset Preparation

CVPR World Model Dataset

WorldArena

RND-Mix Stage One

Train

Offline Test

RND-Mix Stage Two

Train

Offline Test

Online Test

WorldArena 3D

Train

Validation Inference

Test-Set Inference

Validation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
asserts		asserts
iMac		iMac
scripts		scripts
simulator		simulator
third_party		third_party
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

iMac: Translating Actions into Motion and Contact Images for Embodied World Models

Project Page | Data

Demo

Repository Layout

Installation

Path Configuration

Dataset Preparation

CVPR World Model Dataset

WorldArena

RND-Mix Stage One

Train

Offline Test

RND-Mix Stage Two

Train

Offline Test

Online Test

WorldArena 3D

Train

Validation Inference

Test-Set Inference

Validation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages