Skip to content

sailing-lab/sira

Repository files navigation

SiRA: General Agentic Planning through Simulative Reasoning with World Models

SiRA architecture

Paper Website License

Introduction

What does it mean for an agent to plan, rather than merely react? Most agentic systems are reactive (System I): they pick the next action from the current observation, with at most undifferentiated extra compute such as longer chain-of-thought. This limits generalization, as each new task or environment demands re-engineering rather than transfer of a shared reasoning capacity.

SiRA argues that planning should instead be grounded in simulative reasoning (System II): proposing candidate actions, predicting their consequences through a world model, and choosing behavior from those predicted future states. Because it simulates state transitions rather than pattern-matching domain-specific responses, simulative reasoning is a general-purpose planning mechanism that transfers across tasks and environments without re-engineering.

SiRA (Simulative Reasoning Architecture) instantiates this idea as a modular, model-agnostic agent:

  • an encoder that maps observations into natural-language belief states
  • a policy that proposes abstract candidate actions
  • a world model that predicts future belief states
  • a critic that estimates goal progress
  • an actor that translates the selected abstract action into environment commands

This repository releases the paper's browser-environment instantiation, the SiRA web agent, with evaluation scripts for FlightQA, FanOutQA, and WebArena. Across these qualitatively distinct tasks, simulative reasoning yields a consistent gain over a matched reactive baseline, indicating the benefit comes from generalizable counterfactual evaluation rather than task-specific tuning.

Release Scope

This repository contains:

  • the SiRA web agent for open-ended browser tasks through BrowserGym
  • matched reactive and simulative planning modes
  • the OpenHands BrowsingAgent baseline used for comparison
  • FlightQA and FanOutQA evaluation assets
  • WebArena inference and result-analysis helpers
  • a Gradio log visualizer for browser trajectories

It does not contain model training code. The released implementation uses API models through LiteLLM.

Installation

This project is managed with uv.

uv sync --extra eval
uv run playwright install chromium

Create an environment file:

cp .env.example .env
# Edit .env and set OPENAI_API_KEY or SIRA_API_KEY.

API keys

The runner resolves the LLM provider key in this order:

  1. the --api_key command-line flag, if passed;
  2. the SIRA_API_KEY environment variable;
  3. the OPENAI_API_KEY environment variable.

SIRA_API_KEY is simply a SiRA-branded alias for the same provider key; set it if you want SiRA to use a key distinct from the OPENAI_API_KEY in your shell.

Note: values in .env do not override variables already exported in your shell. If you have an OPENAI_API_KEY set in your environment, it takes precedence over the one in .env. Either unset it, or pass --api_key explicitly.

If you plan to run WebArena, follow the upstream WebArena environment setup first, install the WebArena extra with uv sync --extra eval --extra webarena, and export the required site URLs before launching inference.

Quick Start

Run a single open-ended browser task with the default model (gpt-4o):

uv run python scripts/run_web_agent.py demo \
  --query "go to google flights" \
  --mode simulative

Run the matched reactive baseline inside the same SiRA pipeline:

uv run python scripts/run_web_agent.py demo_reactive \
  --query "go to google flights" \
  --mode reactive

Available model names are:

  • gpt-4o
  • o1
  • o3-mini
  • deepseek-chat
  • deepseek-reasoner

You can also pass a JSON model config from configs/, for example:

uv run python scripts/run_web_agent.py demo_o1 \
  --query "go to google flights" \
  --model configs/model_o1_config.json

Evaluation

Task completion rate by reasoning method across the three task categories

Across constrained navigation (FlightQA), multi-hop information aggregation (FanOutQA), and general instruction following (WebArena), simulative reasoning (System II) consistently outperforms the matched reactive policy (System I) and the OpenHands BrowsingAgent baseline. The steps below reproduce these runs.

Run dataset inference:

uv run python scripts/run_web_agent.py fanout_sim \
  --dataset fanout \
  --mode simulative \
  --end_idx 100

uv run python scripts/run_web_agent.py flight_sim \
  --dataset flightqa \
  --mode simulative

Evaluate FanOutQA:

uv run python evaluation/fanout/run.py fanout_sim \
  --browsing_data_dir browsing_data \
  --groundtruth_path data/fanout-final-dev.json \
  --start_idx 0 \
  --end_idx 100

Evaluate FlightQA:

uv run python evaluation/flight/run.py flight_sim \
  --browsing_data_dir browsing_data \
  --questions_path data/flightqa_counterfactual.csv

WebArena inference examples are in:

evaluation/webarena/run_inference.sh

Log Visualizer

uv run python scripts/visualize_logs.py

The visualizer loads JSON traces from browsing_data/ and shows screenshots, observations, belief states, plans, and executed actions.

Repository Structure

src/sira/agent/       Core SiRA web-agent modules
src/sira/search/      Minimal search and world-model interfaces
src/sira/web/         Browser runtime, utilities, visualizer, and baseline
scripts/              Runnable scripts
evaluation/           FanOutQA, FlightQA, and WebArena evaluation helpers
data/                 Released evaluation data
configs/              Module-wise model configs
assets/               Images for docs and visualization

Citation

@article{deng2025sira,
  title={General Agentic Planning Through Simulative Reasoning with World Models},
  author={Deng, Mingkai and Hou, Jinyu and Hu, Zhiting and Xing, Eric},
  journal={arXiv preprint arXiv:2507.23773},
  year={2025}
}

License

This project is released under the Apache License 2.0.

About

SiRA: General Agentic Planning through Simulative Reasoning with World Models

Topics

Resources

License

Stars

Watchers

Forks

Contributors