📄 Paper | 📝 Blog | 🤗 Dataset | 🔎 Data Viewer | 🤖 Models (coming soon) | 🧩 CUA-Gym-Hub
CUA-Gym is a scalable pipeline for synthesizing verifiable RLVR training data for computer-use agents (CUAs). Given a topic, it jointly produces task instructions, environment states, and reward functions as verified triples — using coding agents to handle the engineering work previously requiring human experts.
- 2026-05-21: We release the full pipeline and dataset of CUA-Gym 🔥🔥🔥 (models coming soon)
- We will gradually release the full dataset — the remaining data is currently undergoing administrative review.
Training computer-use agents with reinforcement learning requires a consistent triple of (task instruction, executable environment, verifiable reward). Hand-authoring even one such triple takes hours; CUA-Gym automates this at scale.
Pipeline. Three coordinated agents run per task:
- Generator (
setup-gen): constructs the initial and golden environment states (initial_setup.py,golden_patch.py) - Discriminator (
reward-gen): writesreward.pyfrom the task description alone, without access to Generator's code (information barrier) - Orchestrator: drives the two through iterative rounds until
reward(golden)=1.0andreward(initial)=0.0both hold under execution
Filtering. Verified tuples pass through an LLM majority-vote filter (filter/majority_vote_filter.py) that rejects tasks where the reward is fragile, ambiguous, or inconsistent. Teacher rollouts provide a second filter stage.
Environments. CUA-Gym covers 110 environments: 16 desktop applications and 94 synthesized mock web applications grounded in real-world software-use distributions.
Dataset. The resulting CUA-Gym dataset contains 32,112 verified RLVR training tuples.
Comparison with existing CUA RLVR datasets:
| Dataset | Platform | Data size | Env. size | Reward | Open |
|---|---|---|---|---|---|
| GUI-Genesis | Mobile | 969 | 1 | Programmatic | No |
| WebArena-Infinity | Web | 1,260 | 10 | Programmatic | Yes |
| InfiniteWeb | Web | 600 | — | Programmatic | No★ |
| UltraCUA | Desktop | 17,000 | 9 | Programmatic | No★ |
| Gym-Anything | Desktop | 7,277 | 193 | VLM | Yes |
| CUA-Gym | Desktop + Web | 32,122 | 110 | Programmatic | Yes |
★ partial release.
Install
git clone https://github.com/xlang-ai/CUA-Gym
cd CUA-Gym
pip install -e ".[dev]"
cp .env.example .env # fill in OPENAI_API_KEY and ALIYUN_* credentialsGenerate tasks for a domain
Invoke the task-gen agent from the CUA-Gym directory in Claude Code:
Generate 50 LibreOffice Calc tasks covering formatting and formula operations.
Output: output/task_generation/<topic>.json
Run the adversarial co-generation loop
python scripts/batch_orchestrator.py output/task_generation/calc_formatting.jsonVerified tuples land in output/final/<task_id>/.
Run the majority-vote filter
export OPENAI_API_KEY=sk-...
python filter/majority_vote_filter.py \
--tasks-dir output/final \
--votes 3 \
--model gpt-4o \
--writeDownload the pre-built dataset
huggingface-cli download xlangai/CUA-Gym --repo-type dataset --local-dir data/CUA-Gym-Hub is the environment layer of CUA-Gym: a suite of self-contained mock web applications designed for scalable RL training. Each environment looks and behaves like a realistic web product, while exposing a unified state API for deterministic reset, inspection, mutation, and reward verification.
CUA-Gym-Hub is built by a multi-agent environment synthesis pipeline. Given a target application seed, the system drafts the product specification, implements the mock web app, exercises the UI with Playwright, and iterates until the live interface and API protocol match the specification.
Two design choices make each mock usable as an RL training environment: (1) state injection — a task ships its own JSON initial state alongside its reward.py, so a single mock can host arbitrarily many distinct task worlds with no code change; and (2) session isolation — every URL carries a session id, so parallel RL workers training on the same mock never see one another's mutations. See hub/README.md for the full design rationale and HTTP state API.
What CUA-Gym-Hub provides:
- Realistic mock applications: browser environments spanning productivity, communication, development, commerce, finance, analytics, and media workflows.
- Unified state API: every mock supports programmatic state injection, reset, retrieval, and diffing through a consistent HTTP interface.
- Verifiable rewards: task-specific reward functions can inspect environment state directly instead of relying on screenshots or manual labels.
- Drop-in task generation: generated apps plug into the CUA-Gym task synthesis pipeline as reproducible training environments.
Run a mock app locally
cd hub/websites/notion_mock
npm install
npm run dev # http://localhost:5173Inspect environment state
curl "http://localhost:5173/go?sid=task_001"
# → {"initial_state": {...}, "current_state": {...}, "state_diff": {...}}Every mock supports the same session-scoped state API (/go, /post, /state, /upload, /files/...). For production-style deployment using npm run preview + a reverse proxy, see hub/DEPLOY.md. For the full environment list, schema contract, and app-specific notes, see hub/README.md.
CUA-Gym releases executable RLVR task bundles for computer-use agents. Each row in the Hugging Face Dataset Viewer is a task-level index entry: it contains the natural-language instruction, environment metadata, setup references, and reward-function reference needed to reconstruct the original task bundle.
Important Notice. Some web task setup and reward files require CUA-Gym-Hub endpoints. The public dataset stores these endpoints as placeholders such as
__CUA_GYM_GMAIL_URL__, not as hard-coded hosted URLs. For reliable use, deploy the corresponding CUA-Gym-Hub apps yourself, set theCUA_GYM_*_URLvariables inurl_variables.json, and materialize the task files before running setup or reward code. The release-hostedxlang.aiendpoints are for reference and smoke tests, not for large-scale downstream experiments.
👉 CUA-Gym Hugging Face Dataset
Install the standard Hugging Face dataset tooling:
pip install -U datasets huggingface_hubLoad the task index directly in Python:
from datasets import load_dataset
tasks = load_dataset("xlangai/CUA-Gym", "tasks", split="train")
example = tasks[0]
print(example["instruction"])
print(example["app_type"], example["platform"], example["setup_kind"])Or download the full dataset repository locally:
huggingface-cli download xlangai/CUA-Gym \
--repo-type dataset \
--local-dir ./CUA-Gym-dataIf you plan to execute web tasks, extract the raw bundles and replace endpoint placeholders with your own deployment URLs:
mkdir -p ./cua_gym_tasks
tar --zstd -xf ./CUA-Gym-data/artifacts/cua_gym_tasks_v1.tar.zst -C ./cua_gym_tasks
cat > .env.cua-gym <<'EOF'
CUA_GYM_GMAIL_URL=https://your-gmail-mock.example.com
CUA_GYM_SLACK_URL=https://your-slack-mock.example.com
CUA_GYM_NOTION_URL=https://your-notion-mock.example.com
EOF
python scripts/materialize_dataset_urls.py ./cua_gym_tasks \
--manifest ./CUA-Gym-data/url_variables.json \
--env-file .env.cua-gymThe dataset is organized around one viewer-friendly table plus executable artifacts:
data/
tasks.parquet
artifacts/
cua_gym_tasks_v1.tar.zst
url_variables.json
scripts/
materialize_dataset_urls.py
Each task bundle contains:
<task_id>/
task.json
reward.py
initial_setup.py | initial_setup.sh | initial_setup.xlsx | initial_setup.docx | initial_setup.pptx
To execute a task, extract the artifact archive, read <task_id>/task.json, run the listed setup steps in the target environment, let the agent interact with the environment, and finally run <task_id>/reward.py to compute the programmatic score.
CUA-Gym improves computer-use agents through verifiable RL training over both desktop and web environments. We evaluate trained models on OSWorld-Verified and WebArena, covering realistic multi-step software and browser tasks. CUA-Gym models deliver strong gains over their base models, with the A17B model setting a new open-source state-of-the-art on both benchmarks.
| Model | OSWorld-Verified | WebArena |
|---|---|---|
| Claude Sonnet 4.6 | 72.9 | 65.6 |
| Claude Opus 4.7 | 78.0 | — |
| GPT-5.5 | 78.7 | — |
| EvoCUA-8B | 46.1 | — |
| EvoCUA-32B | 56.7 | — |
| Kimi-K2.6 | 73.1 | — |
| Qwen3.5-35B-A3B (base) | 54.5 | 40.8 |
| Qwen3.5-397B-A17B (base) | 62.2 | 54.0 |
| CUA-Gym-A3B | 62.1 | 44.5 |
| CUA-Gym-A17B | 72.6 | 56.0 |
Both models set state-of-the-art among open-source CUAs at their respective scales. CUA-Gym-A3B matches the much larger A17B base at ~10× fewer active parameters.
@misc{wang2026cuagymscalingverifiabletraining,
title={CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents},
author={Bowen Wang and Dunjie Lu and Junli Wang and Tianyi Bai and Shixuan Liu and Zhipeng Zhang and Haiquan Wang and Hao Hu and Tianbao Xie and Shuai Bai and Dayiheng Liu and Que Shen and Junyang Lin and Tao Yu},
year={2026},
eprint={2605.25624},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.25624},
}CUA-Gym may be used for research, educational, and commercial purposes under the following licenses:
- Code, tools, and pipeline: Apache License 2.0
- Dataset: Creative Commons Attribution 4.0 International (CC BY 4.0)
If you use CUA-Gym, including its code, tools, CUA-Gym-Hub environments, dataset, models, or generated task artifacts, in any report, technical report, publication, thesis, presentation, blog post, documentation, or other publicly shared material, please include an explicit acknowledgement and cite the CUA-Gym paper.
- CUA-Gym may not be used for any purpose or activity that violates applicable laws or regulations in any jurisdiction.
- Use for illegal, unethical, deceptive, privacy-invasive, or harmful activities is strictly prohibited.
- Users may not use CUA-Gym to target real third-party services, accounts, credentials, or production systems without authorization.
- The authors, contributors, and copyright holders are not responsible for any illegal, unethical, or harmful use of CUA-Gym, nor for any direct or indirect damages resulting from such use.
- The released tasks and mock environments are intended for controlled research and evaluation. Users are solely responsible for deploying, sandboxing, and operating them safely.
- Use of the "CUA-Gym" name, logo, or trademarks does not imply endorsement or affiliation unless separate written permission is obtained.