WorldArena 2.0

Extending Embodied World Model Benchmarking on Modality, Functionality and Platform

Overview

WorldArena 2.0 is a standardized benchmark for embodied world models that extends the original WorldArena along three coordinated axes: modality, functionality, and platform. It introduces visuo-tactile evaluation, treats world models as online RL environments, and evaluates across RoboTwin 2.0, LIBERO, and the AgileX split-type ALOHA robot. The goal is not only to measure perceptual quality, but also to verify whether world models remain useful when they must support manipulation, policy improvement, and sim-to-real deployment.

From WorldArena to WorldArena 2.0

WorldArena 2.0 overview diagram
ModalityVision-only

Single-modality prediction and perception.

ModalityVisuo-Tactile

Contact-aware evaluation with tactile signals.

FunctionalityOffline Evaluation

Static scoring of video and tasks.

FunctionalityOnline RL

Interactive rollouts for policy improvement.

PlatformSimulator-only

Limited to virtual environments.

PlatformReal Robot

Cross-embodiment sim-to-real evaluation.

Visuotactile Evaluation

Modality Extension

Vision → Visuo-Tactile

WorldArena 2.0 introduces tactile-aware evaluation so that world models must reason about contact, force, slip, and material interaction.

  • Multimodal perception and prediction
  • Tactile injection via a standardized pipeline
  • Better coverage of contact-rich manipulation
Modality illustration

Insert HDMI

Lift Bottle

ModelTactile Prediction QualityTask Success Rate (%)
PSNR ↑SSIM ↑Insert HDMILift BottleAvg.
ACT (Baseline)----208050
Vidar13.970.27870035
Genie Envisioner13.360.456000
Wan2.221.260.746100050

Evaluation as RL Environments

Functionality Extension

Offline Evaluation → Online RL

The benchmark evaluates whether world models can serve as interactive environments that help train and improve embodied policies, not just predict future frames.

  • Closed-loop rollouts with policy updates
  • Reward-aware state transitions
  • Long-horizon robustness under compounding errors
RL illustration

Click Bell

Adjust Bottle

MethodProxy-basedVLM-basedSimilarity-based
Click BellAdjust BottleClick BellAdjust BottleClick BellAdjust Bottle
SFT43.7555.0843.7555.0843.7555.08
Simulator-based RL87.3078.9087.4578.9087.4578.90
OpenSora56.2560.1655.2757.0353.1358.00
IRASim53.1361.3353.5258.9850.7859.38
iVideoGPT52.5356.2548.4458.5952.1560.93
Cosmos-Predict-2.5(action)67.3863.4854.1058.4063.0961.13
RoboScape68.7560.7455.4659.3863.4859.18
Ctrl-World69.5370.7066.8065.0469.9266.02
WoVR75.0067.1969.3864.4572.0761.35

Analysis of world model and policy interaction step

In evaluating world models as RL environments, we have also analyzed the success rate curves of the policies with increasing the number of policy-environment interaction steps, with the results presented in Figure below. The results indicate that nearly all models can guide policy updates to varying degrees as the number of interaction steps increases.

Policy success rate vs interaction steps

Sim-to-Real Evaluation

Platform Extension

Simulator → Real Robot

WorldArena 2.0 evaluates cross-embodiment generalization from simulation to a real robotic platform, exposing the sim-to-real gap directly.

  • RoboTwin 2.0 and LIBERO simulators
  • AgileX split-type ALOHA physical robot
  • Deployment-oriented performance signal
Simulator

RoboTwin 2.0

Large-scale bimanual manipulation environment used to test robustness under domain randomization.

Adjust BottleClick Bell
Simulator

LIBERO

Language-conditioned manipulation benchmark to diagnose knowledge transfer under structured tasks.

Turn On Stove
Real World

AgileX ALOHA

Physical robot evaluation platform built around an AgileX split-type teleoperation system.

Pour WaterWipe Table
ModelRoboTwin 2.0LIBEROReal‑World
Data EngineAction PlannerData EngineAction PlannerData EngineAction Planner
Task1Task2Task1Task2Task1Task1Task1Task2Task1Task2
GigaWorld213619000000
Genie Envisioner72110202600020
TesserAct135135343800030
Vidar135321922144003010
Wan2.2154112201024100100
CogVideoX328816021010050
Task success rates (%) of world models as embodied data engines and action planners across three platforms.

Real World Demos

Pour Water

Success
Failure Case 1
Failure Case 2

Wipe Table

Success
Failure Case 1
Failure Case 2