WorldArena 2.0

Extending Embodied World Model Benchmarking on Modality, Functionality and Platform

Overview

WorldArena 2.0 is a standardized benchmark for embodied world models that extends the original WorldArena along three coordinated axes: modality, functionality, and platform. It introduces visuo-tactile evaluation, treats world models as online RL environments, and evaluates across RoboTwin 2.0, LIBERO, and the AgileX split-type ALOHA robot. The goal is not only to measure perceptual quality, but also to verify whether world models remain useful when they must support manipulation, policy improvement, and sim-to-real deployment.

From WorldArena to WorldArena 2.0

ModalityVision-only

Single-modality prediction and perception.

ModalityVisuo-Tactile

Contact-aware evaluation with tactile signals.

FunctionalityOffline Evaluation

Static scoring of video and tasks.

FunctionalityOnline RL

Interactive rollouts for policy improvement.

PlatformSimulator-only

Limited to virtual environments.

PlatformReal Robot

Cross-embodiment sim-to-real evaluation.

Visuotactile Evaluation

Modality Extension

Vision → Visuo-Tactile

WorldArena 2.0 introduces tactile-aware evaluation so that world models must reason about contact, force, slip, and material interaction.

Multimodal perception and prediction
Tactile injection via a standardized pipeline
Better coverage of contact-rich manipulation

Insert HDMI

Lift Bottle

Model	Tactile Prediction Quality		Task Success Rate (%)
Model	PSNR ↑	SSIM ↑	Insert HDMI	Lift Bottle	Avg.
ACT (Baseline)	--	--	20	80	50
Vidar	13.97	0.278	70	0	35
Genie Envisioner	13.36	0.456	0	0	0
Wan2.2	21.26	0.746	100	0	50

Evaluation as RL Environments

Functionality Extension

Offline Evaluation → Online RL

The benchmark evaluates whether world models can serve as interactive environments that help train and improve embodied policies, not just predict future frames.

Closed-loop rollouts with policy updates
Reward-aware state transitions
Long-horizon robustness under compounding errors

Click Bell

Adjust Bottle

Method	Proxy-based		VLM-based		Similarity-based
	Click Bell	Adjust Bottle	Click Bell	Adjust Bottle	Click Bell	Adjust Bottle
	SFT	43.75	55.08	43.75	55.08	43.75	55.08
Simulator-based RL	87.30	78.90	87.45	78.90	87.45	78.90
OpenSora	56.25	60.16	55.27	57.03	53.13	58.00
IRASim	53.13	61.33	53.52	58.98	50.78	59.38
iVideoGPT	52.53	56.25	48.44	58.59	52.15	60.93
Cosmos-Predict-2.5(action)	67.38	63.48	54.10	58.40	63.09	61.13
RoboScape	68.75	60.74	55.46	59.38	63.48	59.18
Ctrl-World	69.53	70.70	66.80	65.04	69.92	66.02
WoVR	75.00	67.19	69.38	64.45	72.07	61.35

Analysis of world model and policy interaction step

In evaluating world models as RL environments, we have also analyzed the success rate curves of the policies with increasing the number of policy-environment interaction steps, with the results presented in Figure below. The results indicate that nearly all models can guide policy updates to varying degrees as the number of interaction steps increases.

Policy success rate vs interaction steps

Sim-to-Real Evaluation

Platform Extension

Simulator → Real Robot

WorldArena 2.0 evaluates cross-embodiment generalization from simulation to a real robotic platform, exposing the sim-to-real gap directly.

RoboTwin 2.0 and LIBERO simulators
AgileX split-type ALOHA physical robot
Deployment-oriented performance signal

Simulator

RoboTwin 2.0

Large-scale bimanual manipulation environment used to test robustness under domain randomization.

Adjust BottleClick Bell

Simulator

LIBERO

Language-conditioned manipulation benchmark to diagnose knowledge transfer under structured tasks.

Turn On Stove

Real World

AgileX ALOHA

Physical robot evaluation platform built around an AgileX split-type teleoperation system.

Pour WaterWipe Table

Model	RoboTwin 2.0				LIBERO		Real‑World
Model	Data Engine		Action Planner		Data Engine	Action Planner	Data Engine		Action Planner
	Task1	Task2	Task1	Task2	Task1	Task1	Task1	Task2	Task1	Task2
GigaWorld	2	13	6	19	0	0	0	0	0	0
Genie Envisioner	7	21	10	20	2	6	0	0	0	20
TesserAct	1	35	1	35	34	38	0	0	0	30
Vidar	13	53	2	19	22	14	40	0	30	10
Wan2.2	15	41	12	20	10	24	10	0	10	0
CogVideoX	3	28	8	16	0	2	10	10	0	50

Task success rates (%) of world models as embodied data engines and action planners across three platforms.

Real World Demos

Pour Water

Success

Failure Case 1

Failure Case 2

Wipe Table

Success

Failure Case 1

Failure Case 2