Towards Practical World Model-based Reinforcement Learning for Vision-Language-Action Models
VLA-MBPO is a practical world model-based reinforcement learning framework for fine-tuning vision-language-action (VLA) policies. It trains and serves a unified multimodal world model for pixel-level dynamics and rewards, uses interleaved multi-view decoding for LIBERO-style observations, and updates VLA policies with chunk-level branched rollouts in the learned world model.
This repository contains the official implementation. It is built on OpenPI for VLA policy training and includes the world-model training and serving code used by VLA-MBPO.
- Unified multimodal world model for joint visual dynamics and reward prediction.
- Interleaved head-view and wrist-view decoding for multi-view consistency.
- Chunk-level branched rollout for stable world model-based policy optimization.
- Websocket-based world-model and LIBERO environment servers for distributed RL.
- Training configs for LIBERO Spatial, Object, Goal, and Long-Horizon suites.
.
|-- src/openpi/ # VLA policy models, data pipelines, and RL configs
|-- scripts/train_libero_rl.py # Main VLA-MBPO policy optimization entrypoint
|-- train_rl.sh # End-to-end launcher template
|-- libero_client/ # Websocket LIBERO environment server/client
|-- world_model/ # Unified multimodal world-model training code
|-- world_model/world_model_client/
| `-- websocket_world_model_server.py
|-- examples/ # OpenPI examples and LIBERO evaluation helpers
`-- packages/openpi-client/ # Policy client package inherited from OpenPI
VLA-MBPO currently uses two Python environments:
- root environment: OpenPI policy training and VLA-MBPO RL
world_modelenvironment: unified multimodal world-model training and serving
The root project uses Python 3.11 and uv.
git submodule update --init --recursive
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .
export PYTHONPATH="${PYTHONPATH:-}:$PWD/third_party/libero"Install the OpenPI client package when running remote clients or examples:
uv pip install -e packages/openpi-clientThe LIBERO server depends on the in-repository LIBERO checkout.
uv pip install -e third_party/liberoSome users prefer a separate LIBERO virtual environment because MuJoCo,
robosuite, and rendering dependencies can be sensitive to CUDA/EGL setup. The
launcher template supports this with LIBERO_VENV=/path/to/venv.
The world-model code uses Python 3.10.
cd world_model
conda create -n uni-plan python=3.10 -y
conda activate uni-plan
pip install -r requirements.txtIf flash_attn cannot be built from source, see world_model/README.md for
wheel-based installation guidance.
The public model and dataset artifacts are not bundled in this repository yet. We recommend you to train yourself. For data collection and SFT model training, refer to openpi repo. For world model training, refer to uni-plan. We also provide our training codes in this repo. Before running the full training workflow, prepare the following local paths:
| Variable | Description |
|---|---|
MODEL_CONFIG_PATH |
Base UMM/Bagel config directory |
WORLD_MODEL_CKPT |
Fine-tuned world-model checkpoint |
ACTION_NORM_PATH |
Action normalizer JSON for the selected task suite |
PRETRAINED_POLICY_PATH |
Initial VLA policy checkpoint for RL fine-tuning |
| LIBERO LeRobot datasets | Converted offline data used by pi05_libero_* configs |
The active LIBERO RL configs are defined in src/openpi/training/config.py:
pi05_libero_spatialpi05_libero_objectpi05_libero_goalpi05_libero_long
VLA-MBPO policy optimization runs three services in sequence:
- Start the world-model websocket server.
- Start the LIBERO websocket rollout server.
- Run
scripts/train_libero_rl.py.
The integrated launcher is:
MODEL_CONFIG_PATH=/path/to/BAGEL-7B-MoT \
WORLD_MODEL_CKPT=/path/to/world_model_checkpoint \
ACTION_NORM_PATH=/path/to/action_normalizer.json \
PRETRAINED_POLICY_PATH=/path/to/pretrained_policy/params \
CONFIG_NAME=pi05_libero_goal \
EXP_NAME=libero_goal_vla_mbpo \
bash train_rl.shUseful launcher variables:
| Variable | Default | Description |
|---|---|---|
CONFIG_NAME |
pi05_libero_goal |
RL config name |
EXP_NAME |
vla_mbpo_libero |
Checkpoint and logging run name |
LIBERO_TASK_SUITE |
libero_goal |
LIBERO suite served by libero_client |
LIBERO_PORT |
8113 |
LIBERO websocket server port |
WORLD_MODEL_PORT |
8112 |
World-model websocket server port |
GPU_IDS |
0 1 2 3 |
World-model worker GPU ids |
NUM_WORKERS |
4 |
Number of world-model workers |
NUM_ENVS |
32 |
Parallel LIBERO environments |
CUDA_VISIBLE_DEVICES |
0,1,2,3 |
Devices used by policy training |
WANDB_MODE |
offline |
Weights and Biases mode |
You can also run the RL entrypoint directly after starting both servers:
python scripts/train_libero_rl.py pi05_libero_goal \
--algorithm ppo \
--exp_name libero_goal_vla_mbpo \
--pretrained_path /path/to/pretrained_policy/params \
--port 8113 \
--world_model_port 8112 \
--overwriteWorld-model training scripts are under world_model/scripts/:
cd world_model
bash scripts/train_libero.sh
bash scripts/train_robotwin.sh
bash scripts/train_aloha.shSet dataset, checkpoint, output, and logging paths through environment
variables before launching the scripts. See world_model/README.md for the
underlying UniPlan/Bagel setup and training details.
For standalone LIBERO policy evaluation, see examples/libero/README.md.
For VLA-MBPO training, the rollout server is:
python libero_client/libero_websocket_server.py \
--task-suite-name libero_goal \
--num-envs 32 \
--port 8113 \
--use-rel-rewardThe server exposes batched reset, step, and chunk_step websocket methods
used by scripts/train_libero_rl.py.
If you find this code useful, please cite:
@article{zhang2026vlambpo,
title={Towards Practical World Model-based Reinforcement Learning for Vision-Language-Action Models},
author={Zhang, Zhilong and Ren, Haoxiang and Sun, Yihao and Sheng, Yifei and Wang, Haonan and Lin, Haoxin and Wu, Zhichao and Bacon, Pierre-Luc and Yu, Yang},
journal={arXiv preprint arXiv:2603.20607},
year={2026}
}This codebase builds on OpenPI, LIBERO, LeRobot, and the Bagel/UniPlan world model stack. We thank the authors and maintainers of these projects.
This repository is released under the Apache-2.0 license. See LICENSE for
details.