OpenWebRL: Online Multi-Turn Reinforcement Learning for Visual Web Agents

Rui Yang^1*, Qianhui Wu^2*†, Yuxi Chen¹, Hao Bai¹, Wenlin Yao², Hao Cheng²,
Baolin Peng², Huan Zhang¹, Tong Zhang¹, Jianfeng Gao²

¹ University of Illinois Urbana-Champaign, ² Microsoft Research

*Equal contribution. † Project lead

Links: Paper | Website | Hugging Face | Orchard Env

OpenWebRL is a framework for training visual web agents with online multi-turn reinforcement learning on live websites. The repository builds on top of the Megatron / SGLang-based slime training stack and adds the browser rollout, reward, data, and evaluation components needed for web-agent RL. For large-scale parallel rollouts, OpenWebRL integrates with Orchard, an open-source sandbox environment that provides network-isolated browser instances at scale. We also support local process for web environments.

The main browser-agent implementation lives in openwebrl/. It supports Playwright-based browser interaction, multi-turn multimodal rollouts, tool-call parsing, textual environment feedback, VLM-as-a-judge rewards, and training/evaluation scripts for Qwen3-VL style visual language models.

📋 TODO

Support SFT with Qwen3.5
Support RL with Qwen3.5
Release demo

🔭 Method At A Glance

Stage	What it does	Main entry point
Data preparation	Uses curated browser tasks in parquet / JSONL format and can convert benchmark JSONL files to the training parquet schema.	`openwebrl/data/`
Browser rollout	Runs multi-turn browser episodes with screenshots, tool calls, environment feedback, and response-format checks.	`openwebrl/generate_browser.py`
SFT warm start (optional, recommended)	Trains a supervised fine-tuned checkpoint to initialize online RL, via LLaMAFactory on released OpenWebRL trajectories. RL can also start directly from the base VLM. See `sft/README.md`.	`sft/run_sft_with_llamafactory.sh`
Online RL training	Trains a visual web agent with turn-level or trajectory-level browser rollouts and judge-based rewards, starting from the base VLM or, recommended, an SFT warm-start checkpoint. See Training Scripts for available launchers.	`scripts/run_browser_Qwen3VL_4B_Instruct.sh`
Reward / judge	Combines format rewards with VLM-as-a-judge success evaluation through OpenAI-compatible, Azure, or self-hosted endpoints.	`openwebrl/reward_browser.py`
Evaluation	Evaluates converted Hugging Face checkpoints on browser benchmark task files.	`scripts/run_evaluation.sh`, `openwebrl/run_evaluate.py`
Optional judge SFT	Contains utilities for training and evaluating a smaller browser judge model.	`openwebrl/judge/`

📂 Repository Layout

.
├── README.md
├── requirements.txt
├── set_up.sh
├── setup.py / pyproject.toml
├── train.py
├── slime/                         # Modified RL training framework
├── slime_plugins/                 # Remaining model/plugin extensions
├── tools/                         # Checkpoint conversion helper used by scripts/run_convert_hf.sh
├── docker/                        # Base training Dockerfile and patches
├── scripts/                       # OpenWebRL training/evaluation/conversion entrypoints
│   └── model_configs/             # Model argument presets used by conversion/training
├── sft/                           # SFT warm-start workflow (LLaMAFactory) producing the RL init checkpoint
│   ├── README.md                  # SFT pipeline notes
│   ├── run_sft_with_llamafactory.sh
│   ├── convert_to_openai_messages.py        # Stage 1: trajectories -> canonical OpenAI format
│   ├── prepare_openai_for_llamafactory.py   # Stage 2: canonical -> LLaMAFactory data (official chat template)
│   └── post_process_llamafactory_ckpt.py    # restore base-model config/tokenizer onto the trained checkpoint for serving/RL.
└── openwebrl/
    ├── README.md                  # Detailed browser-environment notes
    ├── generate_browser.py        # Multi-turn browser rollout driver
    ├── reward_browser.py          # Format reward + VLM judge reward
    ├── response_format.py         # Browser-agent response-format handling
    ├── browser_training_config.yaml
    ├── run_evaluate.py
    ├── data/
    │   ├── webgym_filtered_popular_2102_cleaned.parquet
    │   ├── webvoyager_val.parquet
    │   ├── online-mind2web.jsonl
    │   └── convert_benchmark_jsonl_to_parquet.py
    ├── docker/                    # Browser env server Dockerfile / compose / FastAPI server
    ├── env/                       # Playwright browser env and local/sandbox clients
    └── judge/                     # Optional judge-evaluation utilities

📊 Included Data

The release includes small browser datasets under openwebrl/data/:

File	Use
`webgym_filtered_popular_2102_cleaned.parquet`	Default RL training prompts used by the browser training launcher.
`webvoyager_val.parquet`	Default validation prompts.
`online-mind2web.jsonl`	Online-Mind2Web task file for evaluation.

🛠️ Installation

OpenWebRL expects Python 3.10 or newer and a CUDA-capable environment for the full training stack.

# 1. Install Python dependencies, CUDA/SGLang pins, and Playwright Chromium.
bash set_up.sh

# 2. Install this repository in editable mode.
pip install -e .

The browser environment can run in two modes configured in openwebrl/env/config.yaml:

Mode	Description
`local_process`	Starts local `env_server.py` subprocesses on this machine. Useful for debugging and small evaluation runs.
`sandbox`	Uses Orchard Env to create isolated browser pods for large-scale parallel rollout.

We recommend sandbox mode for large-scale rollouts. OpenWebRL integrates with Orchard, an open-source Kubernetes-native sandbox framework that provides per-episode network isolation and scales to hundreds of concurrent browser instances. Isolation significantly reduces the rate at which real websites block agent traffic — in our Online-Mind2Web evaluation without browser base service, the block rate dropped from 25.7% (local process) to 17.7% (Orchard sandbox). This advantage is amplified during online RL, where GRPO-style group rollouts repeatedly query the same site within a single training step, making per-site rate limiting a much more severe bottleneck.

For sandbox mode, build and publish the browser environment image to a registry your cluster can pull from:

docker build -f openwebrl/docker/Dockerfile.browser \
  -t <your-registry>/browser-env:latest .

docker push <your-registry>/browser-env:latest

Then set BROWSER_SANDBOX_IMAGE, SANDBOX_ORCHESTRATOR_URL, and SANDBOX_API_KEY in your environment.

🔑 Required Environment

Start from the template:

cp .env.example .env
$EDITOR .env
source .env

Core paths:

Variable	Description
`SLIME_REPO_ROOT`	Absolute path to this repository.
`SLIME_MODEL_ROOT`	Directory containing pretrained and converted model checkpoints.
`SLIME_SAVE_ROOT`	Directory where training writes checkpoints and logs.
`SLIME_OUTPUT_ROOT`	Directory for evaluation outputs and debug traces.
`PYTHONPATH`	Should include this repo and your Megatron-LM checkout.

Browser environment:

Variable	Description
`SLIME_BROWSER_ENV_MODE`	`sandbox` or `local_process`.
`BROWSER_SANDBOX_IMAGE`	Browser container image for sandbox mode.
`SANDBOX_ORCHESTRATOR_URL`	Sandbox orchestrator URL for sandbox mode.
`SANDBOX_API_KEY`	Sandbox orchestrator API key.

Judge / reward:

Variable	Description
`JUDGE_API_MODE`	`token`, `api_key`, or `served`.
`JUDGE_MODEL`	Judge model name, deployment name, or local/self-hosted model id.
`JUDGE_API_BASE`	OpenAI-compatible endpoint for `served` or `api_key` modes.
`JUDGE_API_KEY`	Judge endpoint key if needed.
`OPENAI_API_KEY`	Optional OpenAI-compatible API key.
`AZURE_RESOURCE_NAME`	Optional Azure resource name for token-mode judge calls.
`AZURE_API_VERSION`	Optional Azure API version.

Experiment tracking:

Variable	Description
`WANDB_API_KEY`	Optional. If unset, W&B logging stays disabled in the launcher scripts.
`WANDB_ENTITY`	Optional W&B entity/team.

🚀 Quick Start

1. Prepare the environment

bash set_up.sh
pip install -e .
cp .env.example .env
$EDITOR .env
source .env

2. Check browser config

Edit openwebrl/env/config.yaml:

mode: sandbox        # or local_process
path_to_task_file: data/online-mind2web.jsonl
use_screenshot: true
use_a11ytree: false

For quick local debugging, set:

mode: local_process

3. Train with online browser RL

By default the launcher's MODEL_NAME points to a model under SLIME_MODEL_ROOT, and online RL can start directly from a base VLM. For best results we recommend an SFT warm start first and initializing RL from that checkpoint. The SFT workflow lives in sft/ — it trains an OpenWebRL SFT checkpoint from the released trajectories with LLaMAFactory and post-processes it for RL reuse (see sft/README.md). To include the SFT stage, produce the checkpoint, place it under SLIME_MODEL_ROOT, and set MODEL_NAME to its path before launching RL.

Training Scripts

Script	Description
`run_browser_Qwen3VL_4B_Instruct.sh`	Main MM-GRPO training launcher for Qwen3-VL-4B. Default entry point for reproducing OpenWebRL-4B.
`run_browser_Qwen3VL_8B_Instruct.sh`	MM-GRPO training launcher for Qwen3-VL-8B.

The main launcher is:

bash scripts/run_browser_Qwen3VL_4B_Instruct.sh

Before running, set at least:

export SLIME_MODEL_ROOT=<path-to-model-checkpoints>
export SLIME_SAVE_ROOT=<path-for-training-outputs>
export JUDGE_API_MODE=<token|api_key|served>
export JUDGE_MODEL=<judge-model-or-deployment>

The launcher reads the included training data by default:

openwebrl/data/webgym_filtered_popular_2102_cleaned.parquet

4. Evaluate a checkpoint

Convert a Megatron checkpoint to Hugging Face format if needed:

bash scripts/run_convert_hf.sh <path-to-iter-dir> <origin-hf-model-dir>

Then run evaluation:

MODEL_PATH=<path-to-hf-checkpoint> \
TASK_FILE=openwebrl/data/online-mind2web.jsonl \
bash scripts/run_evaluation.sh

For local-process smoke tests:

MODEL_PATH=<path-to-hf-checkpoint> \
TASK_FILE=openwebrl/data/online-mind2web.jsonl \
bash scripts/run_evaluation_local.sh

🙏 Acknowledgements

This repository builds on slime, SGLang, Megatron-LM, Megatron-Bridge, Playwright, and the open-source VLM/web-agent ecosystem. We also thank Qwen for releasing the base VLMs used in our experiments, and WebGym for providing the initial browser-task data source.

📚 Citation

@misc{yang2026openwebrldemystifyingonlinemultiturn,
      title={OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents},
      author={Rui Yang and Qianhui Wu and Yuxi Chen and Hao Bai and Wenlin Yao and Hao Cheng and Baolin Peng and Huan Zhang and Tong Zhang and Jianfeng Gao},
      year={2026},
      eprint={2606.02031},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2606.02031},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenWebRL: Online Multi-Turn Reinforcement Learning for Visual Web Agents

📋 TODO

🔭 Method At A Glance

📂 Repository Layout

📊 Included Data

🛠️ Installation

🔑 Required Environment

🚀 Quick Start

1. Prepare the environment

2. Check browser config

3. Train with online browser RL

Training Scripts

4. Evaluate a checkpoint

🙏 Acknowledgements

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
docker		docker
imgs		imgs
openwebrl		openwebrl
scripts		scripts
sft		sft
slime		slime
slime_plugins		slime_plugins
tools		tools
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
build_conda.sh		build_conda.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
set_up.sh		set_up.sh
setup.py		setup.py
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

OpenWebRL: Online Multi-Turn Reinforcement Learning for Visual Web Agents

📋 TODO

🔭 Method At A Glance

📂 Repository Layout

📊 Included Data

🛠️ Installation

🔑 Required Environment

🚀 Quick Start

1. Prepare the environment

2. Check browser config

3. Train with online browser RL

Training Scripts

4. Evaluate a checkpoint

🙏 Acknowledgements

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages