(ICLR'26 Oral + ICML'26) Learning to Seek and Use Information: Agentic Active Reasoning under Partial Observability
This repository is a unified codebase for our research line on learning to seek and use information under partial observatory in LLM agents. It includes the official implementations of T3 (Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents, ICLR 2026 Oral) and AREW (On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM Agents, ICML 2026).
The two works address a shared problem: in long-horizon interactive reasoning, LLM agents must actively acquire information and maintain an accurate belief state. However, standard outcome-based RL often provides too little structure to learn these coupled abilities. T3 studies how belief deviation can make later trajectory segments uninformative or even harmful for learning, while AREW studies how action selection and belief tracking can mutually mask each other’s learning signal, leading to information self-locking.
This repository contains the core algorithms, data preprocessing, training and evaluation pipelines, and experimental setups for both methods. Beyond reproducing the original papers, we are extending the codebase into a broader platform for studying RL-trained LLM agents in realistic active reasoning, tool-use, and deep-research environments.
- We have applied T3 and AREW to tau2-bench and release the code and results in this repo. Refer to this section: Applicability to General Agentic Scenarios. Results on the effectiveness of T3 and AREW over Deep-Research and SWE settings will be released.
- TODOs
- Environment Setup
- Data Preparation
- Training
- Evaluation and Reproduction
- Applicability to General Agentic Scenarios
- Extending the Repository
- Repository Structure
- Citation
The packaged code lives under verl/, so installation is done from that subdirectory.
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e ./verlFor the broader dependency set used by the bundled verl fork:
pip install -r verl/requirements.txtNotes: the following is the version of key packages in the environment we are currently using:
- Python 3.11
- PyTorch 2.7.1
- vLLM 0.10.1
- Ray 2.10.0.19
- Transformers 4.55.4
- flash-attn 2.8.0.post2
- accelerate 1.10.1
The released preprocessing scripts write parquet files in the format expected by main_ppo.py. The dataset (parquet formats) can be found here.
Default raw file location:
<workspace>/CircuitDecoding/cd_raw_2_circuits_<cand>_cand.jsonl
Conversion:
python3 verl/preprocess/data_process/cd.py \
--local_dir /path/to/workspaceThis produces:
CircuitDecoding/train__cand_20.parquetCircuitDecoding/val__cand_20.parquet
The script also supports --input_file, --train_size, --val_size, --train_output, and --val_output.
Default raw file location:
<workspace>/MovieRec/mr_seen_10_un_10_attr_8.jsonl
Conversion:
python3 verl/preprocess/data_process/mr.py \
--local_dir /path/to/workspaceThis produces:
MovieRec/train_seen_10_un_10_attr_8_variant.parquetMovieRec/val_seen_10_un_10_attr_8_variant.parquet
The script also supports overriding the input path, output path, split sizes, data_source, and controller variant.
Tau2Bench data is generated from the task definitions under verl/search_r1/tau2_adapter/.
Solo-mode example:
python3 verl/preprocess/data_process/tau2.py \
--local_dir /path/to/workspace \
--domain telecom \
--train_split train \
--val_split test \
--enable_think \
--think_mode shortThis writes parquet files under:
/path/to/workspace/Tau2Bench/telecom/
If you need filenames aligned with a specific training wrapper, set --train_output and --val_output, or override TRAIN_FILE and VAL_FILE when launching training.
Standard-mode example with an LLM-simulated user:
python3 verl/preprocess/data_process/tau2.py \
--local_dir /path/to/workspace \
--domain telecom \
--source_split full \
--exclude_splits test \
--custom_split_name full_minus_test \
--train_ratio 0.9 \
--seed 42 \
--mode standard \
--enable_think \
--think_mode shortThis produces standard-mode files such as (the following two files are public here.):
Tau2Bench/telecom/train_full_minus_test_9009_standard_think_short.parquetTau2Bench/telecom/val_full_minus_test_9009_standard_think_short.parquet
The AREW release tasks are implemented in verl/search_r1/arew_tasks/ and use the same PPO entry point as T3. Each parquet row should contain:
prompt: the initial user prompt in chat formatcontroller: the task controller consumed by the AREW environmentanswer: the final rule-based reward targetdata_source.
The training wrappers default to the processed parquet names used by the paper runs (the following files are public here.):
verl/data/PE-G/train_dtg_S3.parquetandverl/data/PE-G/val_dtg_S3.parquetverl/data/PE-F/train_dtg_full_attr8_v2.parquetandverl/data/PE-F/val_dtg_full_attr8_v2.parquetverl/data/MediQ/train_beta_known_dtg.parquetandverl/data/MediQ/val_beta_known_dtg.parquetverl/data/FloDial/train_beta_label_opt5_max2.parquetandverl/data/FloDial/val_beta_label_opt5_max2.parquet
You can override these paths with TRAIN_FILE and VAL_FILE.
The canonical entry point is:
python3 -m verl.trainer.main_ppo ...The example scripts under verl/cmd/ are thin wrappers around this command.
For CircuitDecoding and MovieRec, we provide T3 implementation. For Tau2Bench, we provide both T3 and AREW implementation.
bash verl/cmd/cd/ppo.shKey environment overrides:
PROJECT_ROOTDATA_DIRBASE_MODELOUTPUT_ROOTNUM_GPUSTRAIN_FILEVAL_FILE
bash verl/cmd/mrv/ppo.shThe structure is the same as CircuitDecoding, with MovieRec-specific default parquet names.
For solo mode,
bash verl/cmd/tau2/ppo1.1.sh # This is for the solo-mode.verl/cmd/tau2/ppo.shfor a PPO-style baselineverl/cmd/tau2/ppo1.1.shand related variants for T3-enabled settingsverl/cmd/tau2/ppo1.2.shand related variants for AREW-enabled settings
For Tau2Bench standard mode with an LLM-simulated user:
export TAU2_STANDARD_USER_API_KEY=...
export TAU2_STANDARD_USER_AZURE_ENDPOINT=...
bash verl/cmd/tau2s/ppo.sh
bash verl/cmd/tau2s/ppo_AREW.shThe tau2s scripts default to Qwen2.5-14B-Instruct and the standard telecom think-short parquet files. ppo.sh is the vanilla PPO baseline. ppo_AREW.sh reproduces the v427 Tau2-standard AREW setup: targeted labels, step_abs_sum, AREW_BONUS_SCALE=50.0, tau2_arew_min_turn=6, and early_cut=false.
bash verl/cmd/pe/ppo_pe_g.sh
bash verl/cmd/pe/ppo_pe_f.sh
bash verl/cmd/mediq/ppo.sh
bash verl/cmd/flodial/ppo.shThe four scripts expose the main AREW knobs through environment variables:
AS_BONUSandBT_BONUSenable action-selection and belief-tracking critiques.BONUS_MODEdefaults tominority_fixed.BONUS_SCALEcontrols the zero-sum advantage bonus magnitude. The default paper AS&BT scales are0.2for PE-G,1.0for PE-F and MediQ, and0.5for FloDial.AS_CFandBT_CFare only used by MediQ; PE-G, PE-F, and FloDial ignore counterfactual generation.TRAIN_FILE,VAL_FILE,BASE_MODEL,NPC_MODEL, andOUTPUT_ROOToverride data, model, and output paths.
Evaluation is run through the same PPO entry point in validation-only mode, eg,
bash verl/cmd/cd/eval.shThe script merges FSDP checkpoints to Hugging Face format before validation.
T3 is intended to be applicable beyond a single benchmark or environment family.
Many thanks to https://github.com/sierra-research/tau2-bench!
We evaluate on Tau2Bench-Telecom, a multi-turn tool-use benchmark where the agent must resolve realistic customer-service tickets by interacting with an environment through API-like tools. The repository includes both the solo mode setting, where the policy interacts directly with the environment/tool interface, and the standard setting, where the policy interacts with an LLM-simulated user while assistant-side tool calls are executed by the backend environment.
For the solo setting, we derive simple step-level signals directly from the online interaction trace: a step is labeled positive if it increases the number of matched expected actions in the benchmark evaluator, negative if it corresponds to an obvious failure such as a tool error, invalid or malformed action, repeated action, or a write that has no effect, and neutral otherwise. AREW uses these labels to perform within-trajectory advantage redistribution. T3 uses the same signals for trajectory truncation; in our current Tau2Bench setup we use a conservative soft truncation policy with trunc_strength = 8 and set the hard truncation threshold to 999, which effectively disables hard truncation. See details in verl/search_r1/tau2_adapter.
For Tau2-standard, we automatically derive weak step-level critiques from the interaction traces between the agent, tools, and the LLM-simulated user. Positive labels are assigned to steps that uncover new information, advance task completion, or elicit informative user-side diagnostics, while negative labels are assigned to invalid, repetitive, or unproductive behaviors. The resulting labels are transformed into token-level advantage modifiers and incorporated into PPO training as a lightweight auxiliary signal alongside the original outcome reward.
Here are the main experimental results. Weclome to check more experimental analysis and implementation details in the source code and in our paper!
Solo mode: Comparing vanilla PPO with PPO equipped with T3
150 steps training on Qwen-2.5-7B
Solo mode: Comparing vanilla PPO with PPO equipped with AREW
150 steps training on Qwen-2.5-7B
Standard mdoe: Comparing vanilla PPO with PPO equipped with AREW
70 steps training on Qwen-2.5-14B
The default data format consumed by create_rl_dataset() in verl/verl/trainer/main_ppo.py expects records with fields such as:
promptanswerdata_sourceabilityreward_modelextra_info
If the task also needs custom environment metadata or reward-time controller information, include a controller field as done by the released T3 datasets.
For non-standard loading logic, you can either:
- emit the same parquet schema used by the existing preprocessing scripts
- provide a custom dataset through
data.custom_clsin the Hydra config
For example, Tau2-style tasks are organized under verl/search_r1/tau2_adapter/. The main extension points are:
- add task data under
verl/search_r1/tau2_adapter/data/domains/<domain>/ - implement domain environments and tools under
verl/search_r1/tau2_adapter/domains/<domain>/ - register the environment in
verl/search_r1/tau2_adapter/loader/registry.py - keep the rollout contract compatible with
Tau2SoloSpaceinverl/search_r1/tau2_adapter/space.py
.
├── 8182_Reducing_Belief_Deviation.pdf
├── README.md
└── verl/
├── cmd/ # training and evaluation wrappers
├── preprocess/ # data conversion scripts
├── search_r1/ # interactive environments and rollout helpers
└── verl/
└── trainer/
├── main_ppo.py
└── ppo/ray_trainer.py
If you use this repository, please cite the T3 and AREW paper. If your use also depends on the underlying framework components, please additionally cite verl.
@inproceedings{zoureducing,
title={Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents},
author={Zou, Deyu and Chen, Yongqiang and Wang, Jianxiang and YANG, Garry and Li, Mufei and Da, Qing and Cheng, James and Li, Pan and Gong, Yu},
booktitle={The Fourteenth International Conference on Learning Representations}
}@article{zou2026information,
title={On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents},
author={Zou, Deyu and Chen, Yongqiang and Feng, Fan and Li, Mufei and Li, Pan and Gong, Yu and Cheng, James},
journal={arXiv preprint arXiv:2603.12109},
year={2026}
}