What does it mean for an agent to plan, rather than merely react? Most agentic systems are reactive (System I): they pick the next action from the current observation, with at most undifferentiated extra compute such as longer chain-of-thought. This limits generalization, as each new task or environment demands re-engineering rather than transfer of a shared reasoning capacity.
SiRA argues that planning should instead be grounded in simulative reasoning (System II): proposing candidate actions, predicting their consequences through a world model, and choosing behavior from those predicted future states. Because it simulates state transitions rather than pattern-matching domain-specific responses, simulative reasoning is a general-purpose planning mechanism that transfers across tasks and environments without re-engineering.
SiRA (Simulative Reasoning Architecture) instantiates this idea as a modular, model-agnostic agent:
- an encoder that maps observations into natural-language belief states
- a policy that proposes abstract candidate actions
- a world model that predicts future belief states
- a critic that estimates goal progress
- an actor that translates the selected abstract action into environment commands
This repository releases the paper's browser-environment instantiation, the SiRA web agent, with evaluation scripts for FlightQA, FanOutQA, and WebArena. Across these qualitatively distinct tasks, simulative reasoning yields a consistent gain over a matched reactive baseline, indicating the benefit comes from generalizable counterfactual evaluation rather than task-specific tuning.
This repository contains:
- the SiRA web agent for open-ended browser tasks through BrowserGym
- matched
reactiveandsimulativeplanning modes - the OpenHands BrowsingAgent baseline used for comparison
- FlightQA and FanOutQA evaluation assets
- WebArena inference and result-analysis helpers
- a Gradio log visualizer for browser trajectories
It does not contain model training code. The released implementation uses API models through LiteLLM.
This project is managed with uv.
uv sync --extra eval
uv run playwright install chromiumCreate an environment file:
cp .env.example .env
# Edit .env and set OPENAI_API_KEY or SIRA_API_KEY.The runner resolves the LLM provider key in this order:
- the
--api_keycommand-line flag, if passed; - the
SIRA_API_KEYenvironment variable; - the
OPENAI_API_KEYenvironment variable.
SIRA_API_KEY is simply a SiRA-branded alias for the same provider key; set it
if you want SiRA to use a key distinct from the OPENAI_API_KEY in your shell.
Note: values in
.envdo not override variables already exported in your shell. If you have anOPENAI_API_KEYset in your environment, it takes precedence over the one in.env. Eitherunsetit, or pass--api_keyexplicitly.
If you plan to run WebArena, follow the upstream WebArena environment setup
first, install the WebArena extra with uv sync --extra eval --extra webarena,
and export the required site URLs before launching inference.
Run a single open-ended browser task with the default model (gpt-4o):
uv run python scripts/run_web_agent.py demo \
--query "go to google flights" \
--mode simulativeRun the matched reactive baseline inside the same SiRA pipeline:
uv run python scripts/run_web_agent.py demo_reactive \
--query "go to google flights" \
--mode reactiveAvailable model names are:
gpt-4oo1o3-minideepseek-chatdeepseek-reasoner
You can also pass a JSON model config from configs/, for example:
uv run python scripts/run_web_agent.py demo_o1 \
--query "go to google flights" \
--model configs/model_o1_config.jsonAcross constrained navigation (FlightQA), multi-hop information aggregation (FanOutQA), and general instruction following (WebArena), simulative reasoning (System II) consistently outperforms the matched reactive policy (System I) and the OpenHands BrowsingAgent baseline. The steps below reproduce these runs.
Run dataset inference:
uv run python scripts/run_web_agent.py fanout_sim \
--dataset fanout \
--mode simulative \
--end_idx 100
uv run python scripts/run_web_agent.py flight_sim \
--dataset flightqa \
--mode simulativeEvaluate FanOutQA:
uv run python evaluation/fanout/run.py fanout_sim \
--browsing_data_dir browsing_data \
--groundtruth_path data/fanout-final-dev.json \
--start_idx 0 \
--end_idx 100Evaluate FlightQA:
uv run python evaluation/flight/run.py flight_sim \
--browsing_data_dir browsing_data \
--questions_path data/flightqa_counterfactual.csvWebArena inference examples are in:
evaluation/webarena/run_inference.shuv run python scripts/visualize_logs.pyThe visualizer loads JSON traces from browsing_data/ and shows screenshots,
observations, belief states, plans, and executed actions.
src/sira/agent/ Core SiRA web-agent modules
src/sira/search/ Minimal search and world-model interfaces
src/sira/web/ Browser runtime, utilities, visualizer, and baseline
scripts/ Runnable scripts
evaluation/ FanOutQA, FlightQA, and WebArena evaluation helpers
data/ Released evaluation data
configs/ Module-wise model configs
assets/ Images for docs and visualization
@article{deng2025sira,
title={General Agentic Planning Through Simulative Reasoning with World Models},
author={Deng, Mingkai and Hou, Jinyu and Hu, Zhiting and Xing, Eric},
journal={arXiv preprint arXiv:2507.23773},
year={2025}
}This project is released under the Apache License 2.0.