TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

Official codebase for TCOD, a temporal curriculum framework for on-policy distillation that stabilizes knowledge transfer from teacher to student agents in multi-turn interactive environments.

🔥 News

[2026-06] ✍️ New blog post out: on-policy distillation pitfalls — sharing the lessons and pitfalls behind our training. Welcome to read and discuss on my blog!
[2026-04] Paper released on arXiv: arXiv:2604.24005. Code and models are now public!

Introduction

On-policy distillation has emerged as a promising approach to transfer capabilities from large teacher models to smaller student agents. However, in multi-turn agent settings (e.g., ALFWorld, WebShop, ScienceWorld), standard distillation suffers from Trajectory-Level KL Instability: as the student explores longer interaction trajectories, compounding errors push the student's distribution far from the teacher's, making the supervision signal unreliable and causing performance collapse.

TCOD (Temporal Curriculum for On-Policy Distillation) addresses this with a simple but effective idea: instead of exposing the full trajectory to the student from the start, TCOD applies a temporal curriculum that progressively expands the trajectory depth during training — from short, stable prefixes to complete multi-turn rollouts. This keeps the student within the teacher's guidance range throughout training.

TCOD offers two complementary trajectory ordering strategies:

TCOD-b2f (Backward-to-Forward): starts distillation from the later steps of a trajectory, where the task outcome is clearer, and progressively extends supervision toward the beginning.
TCOD-f2b (Forward-to-Backward): starts from the early steps where the student is most on-distribution, and gradually extends to longer horizons.

Key results across three benchmarks:

Up to +18 points improvement over standard on-policy distillation (OPD)
Significantly more stable KL divergence curves throughout training
Student agents that surpass their teachers on several tasks
Better generalization to tasks where the teacher itself fails

What Is Implemented

The following items are implemented in this repo and wired to runnable configs:

Multi-turn OPD workflows for all 3 environments
TCOD-b2f workflows for all 3 environments
TCOD-f2b workflows for all 3 environments
Distillation signal based on student vs teacher token-level logprobs
Example configs under TCOD_examples/*

Not included here: unimplemented TCOD ideas or extra variants not present in code/config.

Repository Layout

opd_multi_turn/
├── TCOD_examples/
│   ├── alfworld/
│   │   ├── opd.yaml
│   │   ├── tcod_b2f.yaml
│   │   └── tcod_f2b.yaml
│   ├── webshop/
│   │   ├── opd.yaml
│   │   ├── tcod_b2f.yaml
│   │   └── tcod_f2b.yaml
│   └── scienceworld/
│       ├── opd.yaml
│       ├── tcod_b2f.yaml
│       └── tcod_f2b.yaml
└── trinity/common/workflows/envs/TCOD/
    ├── alfworld/
    ├── webshop/
    └── scienceworld/

Installation

1) Create environment

conda create -n opd-mt python=3.10
conda activate opd-mt

2) Install project

pip install -e ".[dev]"
pip install flash-attn==2.8.1 --no-build-isolation

If you do not use GPU/flash-attn, adjust installation based on your runtime environment.

Environment Setup

All example YAMLs use placeholder paths. You must update them first. At minimum, check these fields in the selected config:

model.model_path (student model)
explorer.auxiliary_models[0].model_path (teacher model)
buffer.explorer_input.taskset.path (train data)
buffer.explorer_input.eval_tasksets[*].path (eval data, if enabled)

Environment-specific setup instructions are below.

ALFWorld

Step 1: Install alfworld

pip install alfworld

Step 2: Download data

# Option 1: Auto download to ~/.cache/alfworld/
alfworld-download

# Option 2: Specify download path
alfworld-download --data-dir ./alf-data

Step 3: Configure data path

Edit TCOD_examples/alfworld/get_alfworld_data.py:

# Modify to your actual data path
alfworld_data_root = "/your/local/path/alfworld/json_2.1.1"

Note: Keep json_2.1.1 at the end of the path.

Step 4: Process data

cd TCOD_examples/alfworld
python get_alfworld_data.py

Processed data will be saved to TCOD_examples/alfworld/alfworld_data/.

WebShop

Note: WebShop requires ~1TB memory. Skip if resources are limited.

Step 1: Clone WebShop repository

git clone https://github.com/princeton-nlp/webshop.git webshop
cd webshop

Step 2: Install Java 17+

# Using conda
conda install -c conda-forge openjdk=17

Step 3: Run setup script

# Small dataset (recommended for testing)
./setup.sh -d small

# Full dataset
./setup.sh -d all

Note that some Python dependencies may conflict — install them individually if needed.

Step 4: Process data

cd TCOD_examples/webshop
python get_webshop_data.py

Step 5: Configure WebShop path

Option A: Set environment variable

export WEBSHOP_PATH=/path/to/webshop

Option B: Modify workflow files directly

Edit path in all WebShop workflow files (trinity/common/workflows/envs/TCOD/webshop/*.py):

# Find this line and update the path
sys.path.append("/your/path/to/webshop")

ScienceWorld

Step 1: Clone and install ScienceWorld

git clone https://github.com/allenai/ScienceWorld.git
cd ScienceWorld
pip install .

Step 2: Configure jar path

Edit TCOD_examples/scienceworld/get_sciworld_data.py:

# Set the jar path to your ScienceWorld directory
jar_path = "/your/path/ScienceWorld/scienceworld/scienceworld.jar"

Step 3: Process data

cd TCOD_examples/scienceworld
python get_sciworld_data.py

Quick Start

1) Start Ray

ray start --head

2) Run one experiment

# ALFWorld - OPD
trinity run --config TCOD_examples/alfworld/opd.yaml

# ALFWorld - TCOD-b2f
trinity run --config TCOD_examples/alfworld/tcod_b2f.yaml

# ALFWorld - TCOD-f2b
trinity run --config TCOD_examples/alfworld/tcod_f2b.yaml

You can switch to webshop or scienceworld by replacing the config path.

Supported Experiment Matrix

Environment	OPD	TCOD-b2f	TCOD-f2b
ALFWorld	`TCOD_examples/alfworld/opd.yaml`	`TCOD_examples/alfworld/tcod_b2f.yaml`	`TCOD_examples/alfworld/tcod_f2b.yaml`
WebShop	`TCOD_examples/webshop/opd.yaml`	`TCOD_examples/webshop/tcod_b2f.yaml`	`TCOD_examples/webshop/tcod_f2b.yaml`
ScienceWorld	`TCOD_examples/scienceworld/opd.yaml`	`TCOD_examples/scienceworld/tcod_b2f.yaml`	`TCOD_examples/scienceworld/tcod_f2b.yaml`

Workflow Names in Config

Each YAML selects workflow by buffer.explorer_input.default_workflow_type:

OPD:
- OPD_alfworld_workflow
- OPD_webshop_workflow
- OPD_scienceworld_workflow
TCOD-b2f:
- TCOD_b2f_alfworld_workflow
- TCOD_b2f_webshop_workflow
- TCOD_b2f_scienceworld_workflow
TCOD-f2b:
- TCOD_f2b_alfworld_workflow
- TCOD_f2b_webshop_workflow
- TCOD_f2b_scienceworld_workflow

Key Config Notes

algorithm.advantage_fn should stay multi_turn_opd for these workflows.
rollout_args.logprobs must be enabled (e.g., 0) for distillation gap computation.
TCOD configs currently use workflow_args.checkpoint_strategy: linear.
Typical knobs you may tune:
- buffer.total_steps
- trainer.total_steps
- workflow_args.max_env_steps
- workflow_args.checkpoint_steps (TCOD)

Outputs

By default, experiments write checkpoints under:

checkpoint_root_dir (usually ./checkpoints)

And logging/monitor settings are controlled by:

monitor.monitor_type (e.g., wandb)

Citation

@article{wang2026tcod,
  title   = {TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents},
  author  = {Jiaqi Wang and Wenhao Zhang and Weijie Shi and Yaliang Li and James Cheng},
  journal = {arXiv preprint arXiv:2604.24005},
  year    = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github		.github
TCOD_examples		TCOD_examples
asserts		asserts
benchmark		benchmark
docs		docs
environments		environments
scripts		scripts
tests		tests
trinity		trinity
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

🔥 News

Introduction

What Is Implemented

Repository Layout

Installation

1) Create environment

2) Install project

Environment Setup

ALFWorld

WebShop

ScienceWorld

Quick Start

1) Start Ray

2) Run one experiment

Supported Experiment Matrix

Workflow Names in Config

Key Config Notes

Outputs

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

🔥 News

Introduction

What Is Implemented

Repository Layout

Installation

1) Create environment

2) Install project

Environment Setup

ALFWorld

WebShop

ScienceWorld

Quick Start

1) Start Ray

2) Run one experiment

Supported Experiment Matrix

Workflow Names in Config

Key Config Notes

Outputs

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages