Skip to content

xlang-ai/CUA-Gym

Repository files navigation

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

📄 Paper | 📝 Blog | 🤗 Dataset | 🔎 Data Viewer | 🤖 Models (coming soon) | 🧩 CUA-Gym-Hub

arXiv Dataset Models License Python

CUA-Gym is a scalable pipeline for synthesizing verifiable RLVR training data for computer-use agents (CUAs). Given a topic, it jointly produces task instructions, environment states, and reward functions as verified triples — using coding agents to handle the engineering work previously requiring human experts.

CUA-Gym pipeline

📣 Updates

  • 2026-05-21: We release the full pipeline and dataset of CUA-Gym 🔥🔥🔥 (models coming soon)
  • We will gradually release the full dataset — the remaining data is currently undergoing administrative review.

About

Training computer-use agents with reinforcement learning requires a consistent triple of (task instruction, executable environment, verifiable reward). Hand-authoring even one such triple takes hours; CUA-Gym automates this at scale.

Pipeline. Three coordinated agents run per task:

  • Generator (setup-gen): constructs the initial and golden environment states (initial_setup.py, golden_patch.py)
  • Discriminator (reward-gen): writes reward.py from the task description alone, without access to Generator's code (information barrier)
  • Orchestrator: drives the two through iterative rounds until reward(golden)=1.0 and reward(initial)=0.0 both hold under execution

Filtering. Verified tuples pass through an LLM majority-vote filter (filter/majority_vote_filter.py) that rejects tasks where the reward is fragile, ambiguous, or inconsistent. Teacher rollouts provide a second filter stage.

Environments. CUA-Gym covers 110 environments: 16 desktop applications and 94 synthesized mock web applications grounded in real-world software-use distributions.

Dataset. The resulting CUA-Gym dataset contains 32,112 verified RLVR training tuples.

Comparison with existing CUA RLVR datasets:

Dataset Platform Data size Env. size Reward Open
GUI-Genesis Mobile 969 1 Programmatic No
WebArena-Infinity Web 1,260 10 Programmatic Yes
InfiniteWeb Web 600 Programmatic No★
UltraCUA Desktop 17,000 9 Programmatic No★
Gym-Anything Desktop 7,277 193 VLM Yes
CUA-Gym Desktop + Web 32,122 110 Programmatic Yes

★ partial release.

Getting Started

Install

git clone https://github.com/xlang-ai/CUA-Gym
cd CUA-Gym
pip install -e ".[dev]"
cp .env.example .env  # fill in OPENAI_API_KEY and ALIYUN_* credentials

Generate tasks for a domain

Invoke the task-gen agent from the CUA-Gym directory in Claude Code:

Generate 50 LibreOffice Calc tasks covering formatting and formula operations.

Output: output/task_generation/<topic>.json

Run the adversarial co-generation loop

python scripts/batch_orchestrator.py output/task_generation/calc_formatting.json

Verified tuples land in output/final/<task_id>/.

Run the majority-vote filter

export OPENAI_API_KEY=sk-...
python filter/majority_vote_filter.py \
  --tasks-dir output/final \
  --votes 3 \
  --model gpt-4o \
  --write

Download the pre-built dataset

huggingface-cli download xlangai/CUA-Gym --repo-type dataset --local-dir data/

CUA-Gym-Hub

CUA-Gym-Hub is the environment layer of CUA-Gym: a suite of self-contained mock web applications designed for scalable RL training. Each environment looks and behaves like a realistic web product, while exposing a unified state API for deterministic reset, inspection, mutation, and reward verification.

CUA-Gym supported environments

CUA-Gym-Hub environment pipeline

CUA-Gym-Hub is built by a multi-agent environment synthesis pipeline. Given a target application seed, the system drafts the product specification, implements the mock web app, exercises the UI with Playwright, and iterates until the live interface and API protocol match the specification.

Two design choices make each mock usable as an RL training environment: (1) state injection — a task ships its own JSON initial state alongside its reward.py, so a single mock can host arbitrarily many distinct task worlds with no code change; and (2) session isolation — every URL carries a session id, so parallel RL workers training on the same mock never see one another's mutations. See hub/README.md for the full design rationale and HTTP state API.

What CUA-Gym-Hub provides:

  • Realistic mock applications: browser environments spanning productivity, communication, development, commerce, finance, analytics, and media workflows.
  • Unified state API: every mock supports programmatic state injection, reset, retrieval, and diffing through a consistent HTTP interface.
  • Verifiable rewards: task-specific reward functions can inspect environment state directly instead of relying on screenshots or manual labels.
  • Drop-in task generation: generated apps plug into the CUA-Gym task synthesis pipeline as reproducible training environments.

Run a mock app locally

cd hub/websites/notion_mock
npm install
npm run dev          # http://localhost:5173

Inspect environment state

curl "http://localhost:5173/go?sid=task_001"
# → {"initial_state": {...}, "current_state": {...}, "state_diff": {...}}

Every mock supports the same session-scoped state API (/go, /post, /state, /upload, /files/...). For production-style deployment using npm run preview + a reverse proxy, see hub/DEPLOY.md. For the full environment list, schema contract, and app-specific notes, see hub/README.md.

CUA-Gym Datasets

CUA-Gym releases executable RLVR task bundles for computer-use agents. Each row in the Hugging Face Dataset Viewer is a task-level index entry: it contains the natural-language instruction, environment metadata, setup references, and reward-function reference needed to reconstruct the original task bundle.

Important Notice. Some web task setup and reward files require CUA-Gym-Hub endpoints. The public dataset stores these endpoints as placeholders such as __CUA_GYM_GMAIL_URL__, not as hard-coded hosted URLs. For reliable use, deploy the corresponding CUA-Gym-Hub apps yourself, set the CUA_GYM_*_URL variables in url_variables.json, and materialize the task files before running setup or reward code. The release-hosted xlang.ai endpoints are for reference and smoke tests, not for large-scale downstream experiments.

👉 CUA-Gym Hugging Face Dataset

Install the standard Hugging Face dataset tooling:

pip install -U datasets huggingface_hub

Load the task index directly in Python:

from datasets import load_dataset

tasks = load_dataset("xlangai/CUA-Gym", "tasks", split="train")
example = tasks[0]

print(example["instruction"])
print(example["app_type"], example["platform"], example["setup_kind"])

Or download the full dataset repository locally:

huggingface-cli download xlangai/CUA-Gym \
  --repo-type dataset \
  --local-dir ./CUA-Gym-data

If you plan to execute web tasks, extract the raw bundles and replace endpoint placeholders with your own deployment URLs:

mkdir -p ./cua_gym_tasks
tar --zstd -xf ./CUA-Gym-data/artifacts/cua_gym_tasks_v1.tar.zst -C ./cua_gym_tasks

cat > .env.cua-gym <<'EOF'
CUA_GYM_GMAIL_URL=https://your-gmail-mock.example.com
CUA_GYM_SLACK_URL=https://your-slack-mock.example.com
CUA_GYM_NOTION_URL=https://your-notion-mock.example.com
EOF

python scripts/materialize_dataset_urls.py ./cua_gym_tasks \
  --manifest ./CUA-Gym-data/url_variables.json \
  --env-file .env.cua-gym

The dataset is organized around one viewer-friendly table plus executable artifacts:

data/
  tasks.parquet
artifacts/
  cua_gym_tasks_v1.tar.zst
url_variables.json
scripts/
  materialize_dataset_urls.py

Each task bundle contains:

<task_id>/
  task.json
  reward.py
  initial_setup.py | initial_setup.sh | initial_setup.xlsx | initial_setup.docx | initial_setup.pptx

To execute a task, extract the artifact archive, read <task_id>/task.json, run the listed setup steps in the target environment, let the agent interact with the environment, and finally run <task_id>/reward.py to compute the programmatic score.

Results

CUA-Gym improves computer-use agents through verifiable RL training over both desktop and web environments. We evaluate trained models on OSWorld-Verified and WebArena, covering realistic multi-step software and browser tasks. CUA-Gym models deliver strong gains over their base models, with the A17B model setting a new open-source state-of-the-art on both benchmarks.

Model OSWorld-Verified WebArena
Claude Sonnet 4.6 72.9 65.6
Claude Opus 4.7 78.0
GPT-5.5 78.7
EvoCUA-8B 46.1
EvoCUA-32B 56.7
Kimi-K2.6 73.1
Qwen3.5-35B-A3B (base) 54.5 40.8
Qwen3.5-397B-A17B (base) 62.2 54.0
CUA-Gym-A3B 62.1 44.5
CUA-Gym-A17B 72.6 56.0

Both models set state-of-the-art among open-source CUAs at their respective scales. CUA-Gym-A3B matches the much larger A17B base at ~10× fewer active parameters.

Citation

@misc{wang2026cuagymscalingverifiabletraining,
      title={CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents},
      author={Bowen Wang and Dunjie Lu and Junli Wang and Tianyi Bai and Shixuan Liu and Zhipeng Zhang and Haiquan Wang and Hao Hu and Tianbao Xie and Shuai Bai and Dayiheng Liu and Que Shen and Junyang Lin and Tao Yu},
      year={2026},
      eprint={2605.25624},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.25624},
}

Research and Commercial Use

CUA-Gym may be used for research, educational, and commercial purposes under the following licenses:

Citation and Acknowledgement

If you use CUA-Gym, including its code, tools, CUA-Gym-Hub environments, dataset, models, or generated task artifacts, in any report, technical report, publication, thesis, presentation, blog post, documentation, or other publicly shared material, please include an explicit acknowledgement and cite the CUA-Gym paper.

Prohibited Uses

  • CUA-Gym may not be used for any purpose or activity that violates applicable laws or regulations in any jurisdiction.
  • Use for illegal, unethical, deceptive, privacy-invasive, or harmful activities is strictly prohibited.
  • Users may not use CUA-Gym to target real third-party services, accounts, credentials, or production systems without authorization.

Disclaimer

  • The authors, contributors, and copyright holders are not responsible for any illegal, unethical, or harmful use of CUA-Gym, nor for any direct or indirect damages resulting from such use.
  • The released tasks and mock environments are intended for controlled research and evaluation. Users are solely responsible for deploying, sandboxing, and operating them safely.
  • Use of the "CUA-Gym" name, logo, or trademarks does not imply endorsement or affiliation unless separate written permission is obtained.

About

Scalable pipeline for synthesizing verifiable RLVR training data for computer-use agents

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages