BEACON

Milestone-Guided Policy Learning for Long-Horizon Language Agents

Zixuan Wang^1,2, Yuchen Yan¹, Hongxing Li¹, Teng Pan^1,2, Dingming Li¹, Ruiqing Zhang²,
Weiming Lu¹, Jun Xiao¹, Yueting Zhuang¹, Yongliang Shen^1,†
¹Zhejiang University · ²Baidu Inc.
_{^†Corresponding author}

This repository contains the official implementation of BEACON, a milestone-guided policy learning framework that addresses two pathologies of trajectory-level RL on long-horizon language-agent tasks: credit misattribution (correct early actions penalized by terminal failure) and sample inefficiency (partial successes wasted under sparse rewards). The implementation is built on top of verl-agent; only the components contributed by this work are documented here.

Highlights

Consistent gains on three long-horizon benchmarks with a single set of hyperparameters ($\gamma=0.95$, $\lambda=1.0$).
Horizon-dependent gains. On ALFWorld Long tasks, BEACON reaches 92.9% vs. 53.5% for GRPO. Relative gains over GRPO scale from +26.2% (Short) to +73.6% (Long).
Recovers learning signal from partial successes. Effective sample utilization improves from 23.7% to 82.0% on ALFWorld.
Outperforms behavior cloning. 91.4% vs. 43% for SFT on oracle trajectories — gains stem from policy optimization, not milestone imitation.

_{Main results. BEACON outperforms GRPO and GiGPO across ALFWorld, ScienceWorld, and WebShop at both 1.5B and 7B scales.}

Method

BEACON operates in three stages:

Trajectory partitioning. A milestone indicator $\Phi$ identifies verifiable subgoal-completion transitions, splitting each trajectory into segments at milestone boundaries. $\Phi$ is environment-defined and requires no learned model: ALFWorld uses object/state predicates, WebShop uses page-transition phases, ScienceWorld exposes subgoal_completed directly.
Temporal reward shaping. Within each completed segment, actions receive shaped reward $r_t = R_{\text{ms}} \cdot \gamma^{t_k - t}$, giving graduated positive credit to actions leading up to a milestone and converting partial successes into learning signal.
Dual-scale advantage estimation. Trajectory-level advantage (GRPO-style) captures global task performance; segment-level advantage compares only among trajectories that reached the same milestone, isolating local action quality from variance in later segments. The two are combined as $\hat{A}_{i,t} = A^{\text{traj}}i + \lambda \cdot A^{\text{seg}}{i,t}$.

At update time, BEACON automatically routes each batch based on which milestone field is present (trial_id for ALFWorld, milestone_achieved for WebShop, subgoal_completed for ScienceWorld), so a single training pipeline supports all three environments without environment-specific code paths in the trainer.

Repository layout

migpo/                   # BEACON core: advantage / step-reward computation and milestone detector
agent_system/            # ALFWorld, WebShop, and ScienceWorld environment integrations
examples/migpo_trainer/  # Paper-locked training scripts (one per environment)

Everything else is inherited from the upstream verl-agent framework.

Installation

1. Base framework

conda create -n verl-agent python==3.12 -y
conda activate verl-agent

pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install -e .
pip3 install vllm==0.8.5

Each environment below is best installed in its own dedicated conda environment to avoid dependency conflicts.

2. ALFWorld

pip3 install gymnasium==0.29.1
pip3 install stable-baselines3==2.6.0
pip install alfworld
pip install vllm==0.8.5

# Download PDDL & Game files and the pre-trained MaskRCNN detector
alfworld-download -f

3. WebShop

WebShop requires Python ≤ 3.10:

conda create -n verl-agent-webshop python==3.10 -y
conda activate verl-agent-webshop

cd ./agent_system/environments/env_package/webshop/webshop
./setup.sh -d all

cd repo_root/
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install -e .
pip3 install vllm==0.8.2

4. ScienceWorld

ScienceWorld requires Java 1.8+ and Python ≤ 3.10:

conda create -n verl-agent-sciworld python==3.10 -y
conda activate verl-agent-sciworld

cd repo_root/
pip3 install torch==2.6.0+cu124 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install -e .
pip3 install vllm==0.8.2

# Java via conda (not system-wide)
conda install -c conda-forge openjdk=11 -y

# ScienceWorld ships its own bundled JAR and the py4j bridge
pip install scienceworld

Variation indices used by our experiments are included at agent_system/environments/env_package/sciworld/variations_idx/.

Sanity check:

python -c "from scienceworld import ScienceWorldEnv; print('ScienceWorld import successful')"

Training

Paper-locked training scripts (Qwen2.5-1.5B-Instruct, single 8-GPU node) live in examples/migpo_trainer/:

bash examples/migpo_trainer/run_alfworld.sh    # ALFWorld
bash examples/migpo_trainer/run_webshop.sh     # WebShop
bash examples/migpo_trainer/run_sciworld.sh    # ScienceWorld

Acknowledgement

This codebase builds on verl-agent, which itself extends veRL. We thank the authors of those projects, and the maintainers of the supported environments — ALFWorld, WebShop, and ScienceWorld.

Citation

@misc{wang2026milestoneguidedpolicylearninglonghorizon,
  title         = {Milestone-Guided Policy Learning for Long-Horizon Language Agents},
  author        = {Zixuan Wang and Yuchen Yan and Hongxing Li and Teng Pan and Dingming Li and Ruiqing Zhang and Weiming Lu and Jun Xiao and Yueting Zhuang and Yongliang Shen},
  year          = {2026},
  eprint        = {2605.06078},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2605.06078},
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github		.github
agent_system		agent_system
assets		assets
docker		docker
docs		docs
examples		examples
gigpo		gigpo
migpo		migpo
pages		pages
recipe		recipe
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BEACON

Milestone-Guided Policy Learning for Long-Horizon Language Agents

Highlights

Method

Repository layout

Installation

1. Base framework

2. ALFWorld

3. WebShop

4. ScienceWorld

Training

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BEACON

Milestone-Guided Policy Learning for Long-Horizon Language Agents

Highlights

Method

Repository layout

Installation

1. Base framework

2. ALFWorld

3. WebShop

4. ScienceWorld

Training

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages