This is the official code repository for the paper Self-Steering Language Models by Gabriel Grand, Joshua B. Tenenbaum, Vikash K. Mansinghka, Alexander K. Lew, and Jacob Andreas, presented at the Second Conference on Language Modeling (COLM 2025).
Self-steering is built on top of LLaMPPL, a probabilistic programming library for language models. We provide a condensed tutorial on LLaMPPL as part of the system prompt for our framework. Readers are encouraged to check out the LLaMPPL documentation for more details.
git clone git@github.com:gabegrand/self-steering.git
This codebase uses Poetry to manage dependencies. If you don't have Poetry installed, you can do so by following the instructions here.
cd self-steering
poetry installNote
If you also want to install optional development dependencies (e.g., for running unit tests, code formatting, and plotting), you can do so with:
poetry install --with devOnce the installation is complete, you can activate the virtual environment:
# Default with Poetry v2.0
eval $(poetry env activate)
# Alternative with Poetry v1.x or with the poetry-shell plugin
poetry shellFor convenience, we also provide a [build-system] section in pyproject.toml, so you can install the package with pip. We recommend using a virtual environment (e.g., via venv or conda) to avoid dependency conflicts.
cd self-steering
pip install -e .To run experiments on the COLLIE benchmark, you must first install the COLLIE submodule.
git submodule update --init -- evaluations/collie_eval/CollieNote
When you install the COLLIE submodule with the above command, the COLLIE-v1 dataset is automatically downloaded. Alternatively, you can install collie-bench from PyPI and manually download data/all_data.dill.
Additionally, inside a Python shell, download punkt_tab (required by COLLIE evaluation methods).
import nltk
nltk.download('punkt_tab')To check that everything is working correctly, you can run the unit tests (requires the dev dependencies):
pytest tests- Pipelines map to experiment conditions in the paper. Each pipeline is defined in disciple/pipelines.py and configured via JSON files in disciple/configs.
- The Planner (paper) corresponds to the
model_generatorin disciple/model_generators.py. - The Follower is executed inside pipeline
infer()calls (e.g.,DisciplePipeline,VLLMPipeline,HuggingFacePipeline). - The main entry point is evaluations/run_benchmark.py. Use
--pipelinesto select one or more configs (filenames without.json).
- DisCIPL (ours)
disciple→ DisCIPL-SMCdisciple_importance→ DisCIPL-IS (no resampling)disciple_rejection→ DisCIPL-RS (single-step generation)
- DisCIPL (oracle variants)
disciple_oracle_smc,disciple_oracle_importance,disciple_oracle_rejection(Planner can peek at expert programs for the target task)
- Planner-only (OpenAI)
gpt_4o_mini,gpt_4o,gpt_4o_cot,o1
- Follower-only (local/inference backends)
vllm,vllm_beam_search(uses vLLM)huggingface,huggingface_beam_search(uses HF Transformers)
- Required keys in
.env(or your environment):OPENAI_API_KEY=...(for OpenAI models)HF_TOKEN=...(for gated HuggingFace models like Llama-3.x)
- Backend selection:
--backend vllm(default) or--backend hf - GPU selection:
--device-id 0setsCUDA_VISIBLE_DEVICES=0
Each on its own line, with a short description:
# DisCIPL + baselines
export PIPELINES="disciple disciple_importance disciple_rejection vllm vllm_beam_search gpt_4o_mini gpt_4o o1"
# DisCIPL oracle variants (Planner gets expert programs)
export ORACLE_PIPELINES="disciple_oracle_smc disciple_oracle_importance disciple_oracle_rejection"Use --pipelines with the config names above (no .json). Override models as needed with --openai-model and --huggingface-model.
python evaluations/run_benchmark.py \
--pipelines ${PIPELINES} \
--device-id 0 \
--particles 1 2 4 8 16 32 \
--max-tokens 32 \
--timeout 120 \
--dataset collie \
--task-types sent_01 sent_02 sent_03 sent_04 \
--openai-model "gpt-4o-2024-08-06" \
--huggingface-model "meta-llama/Llama-3.2-1B-Instruct"python evaluations/run_benchmark.py \
--pipelines ${PIPELINES} \
--device-id 0 \
--particles 1 2 4 8 \
--max-tokens 128 \
--timeout 240 \
--dataset collie \
--task-types para_01 para_02 para_03 para_04 para_05 \
--openai-model "gpt-4o-2024-08-06" \
--huggingface-model "meta-llama/Llama-3.2-1B-Instruct"python evaluations/run_benchmark.py \
--pipelines ${PIPELINES} \
--device-id 0 \
--particles 1 2 4 8 16 \
--max-tokens 512 \
--timeout 360 \
--dataset puzzle \
--repeat 10 \
--openai-model "gpt-4o-2024-08-06" \
--huggingface-model "meta-llama/Llama-3.2-1B-Instruct"- DisCIPL-SMC (and beam search pipelines) are Adaptive: all N values in
--particlesare executed (e.g., 1+2+4+...). - Importance/Rejection and plain sampling pipelines are Independent: only
max(--particles)samples are drawn. - vLLM beam search uses
Nas beam width.
- Resume a partially completed run:
--resume-from-experiment /path/to/old/results - Results directory:
results/<timestamp>/results.json: flattened list ofInferenceResultentriespipelines/<pipeline_name>/task_XXXX/completion.json: raw Planner or model completions (when available)
- Model generation (Planner) options:
--n-attempts: number of retries to generate the LLaMPPL program (Planner). In DisCIPL pipelines, feedback from failures is automatically included whenn_attempts > 1.--cache-behavior: controls program reuse in the Planner. See disciple/model_generators.py.optional(default): use cached program for a task/version if present; otherwise generate.require: load cached program; error if missing.latest_read_only: always load the latest cached version; do not generate.force: always regenerate a new program (ignores cache).
--load-models-from-path /path/to/prev/results: reuse programs from a prior experiment. The pipeline will look under that path atpipelines/<pipeline_name>/model_generator/.--debug-mode: generate a minimal debug model artifact instead of calling the Planner (for development).--openai-model: override the Planner model (e.g.,gpt-4o-2024-08-06).
- Inference (Follower) options:
--huggingface-model: follower model ID (e.g.,meta-llama/Llama-3.2-1B-Instruct).--backend {vllm,hf}: inference backend for the follower; defaults tovllm.--device-id: GPU index. SetsCUDA_VISIBLE_DEVICES.--particles: values of N to evaluate. Adaptive pipelines run all values; independent pipelines runmax(N).--ess-threshold: resampling threshold for SMC (default0.5).--max-tokens,--temperature,--timeout: decoding and runtime limits.
- Dataset control:
--task-types: restrict to specific tasks (e.g.,sent_01 sent_02).--n-examples-per-task-type: subsample per type (with replacement) for quick tests.--repeat: duplicate each task (used in Puzzles).--seed: RNG seed.
See all flags in evaluations/arg_parser.py.
- DisCIPL-generated programs subclass
BaseModelin disciple/base_models.py. - Hand-written example models are provided for reference:
- COLLIE: evaluations/collie_eval/models.py
- Puzzles: evaluations/puzzle/models.py
All plots and tables were produced in evaluations/analysis_new.ipynb. Point the notebook at one or more results/<timestamp>/ directories to reproduce figures.
@inproceedings{
grand2025selfsteering,
title={Self-Steering Language Models},
author={Gabriel Grand and Joshua B. Tenenbaum and Vikash K. Mansinghka and Alexander K. Lew and Jacob Andreas},
booktitle={Second Conference on Language Modeling},
year={2025},
url={https://openreview.net/forum?id=XvCBtm5PgF}
}