🚀 Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
Entropy-Guided Stepwise Policy Optimization with Stepwise Advantages (EGSPO-SA) introduces a reinforcement learning framework for diffusion language models (dLLMs) 🤖. Unlike autoregressive LLMs, diffusion models generate sequences through an iterative denoising process, making standard sequence-level RL fine-tuning challenging.
We formulate the denoising trajectory as a finite-horizon Markov decision process 🔁 and derive a policy-gradient objective that decomposes across denoising steps. Our method focuses learning on the most informative steps and introduces a lightweight stepwise advantage estimator ⚡ for efficient training.
- 🧠 Diffusion-MDP formulation for RL fine-tuning of diffusion language models
- 📊 Entropy-guided step selection to identify the most informative denoising steps
- ⚡ EGSPO-SA, a lightweight stepwise advantage estimator that avoids separate value models
- 🏆 Strong empirical results on coding, logical reasoning, and mathematical reasoning benchmarks
Clone the repository and install dependencies:
git clone https://github.com/vishnutez/egspo-dllm-rl.git
cd egspo-dllm-rl
conda env create -f environment.yml
conda activate egspo-envConfigure the required environment variables (e.g., WANDB_API_KEY, HF_HOME, etc.) in egspo/train.sh
We provide an multi-node sbatch script for running experiments on a cluster. The script can also be easily adapted to a standard .sh file if needed.
Run training with:
sbatch egspo/train.shUnless otherwise specified in the paper, the default parameters in epsa/train.sh correspond to the configurations used in our experiments.
For Dream-7B-Instruct, the train and trainer scripts are located in egspo/DREAM/.
Before running evaluation, update the required fields in:
eval/eval_checkpoints.shIn particular, set the following variables:
CHECKPOINT_DIR— directory containing the trained model checkpoints to evaluateOUTPUT_DIR— directory where generated completions will be savedTASKS— evaluation task(s) (e.g.,gsm8k,sudoku, etc.)GEN_LENGTHS— generation lengths to evaluate<YOUR_CONDA_ENV_NAME>— your conda environment name<YOUR_HF_HOME_DIR>(optional) — Hugging Face cache directory (remove if using the default)
Then run:
bash eval/eval_checkpoints.shThis step generates model completions for the test prompts using the selected checkpoints.
The script also parses predicted answers from model outputs and extracts ground-truth answers from the dataset, preparing them for evaluation.
Modify the following fields in:
eval/get_and_save_metrics.pytaskcheckpoint_dirgenerated_lengths
Using the completions generated in the previous step, this script computes evaluation metrics by comparing predicted answers with ground-truth answers.
The results are then saved as .json files for each evaluated checkpoint.
Task-specific checkpoints and usage instructions are available on HuggingFace:
🔗 fatemehdoudi97/egspo-llada-8b
🔗 fatemehdoudi97/egspo-dream-7b
Our implementation builds upon the codebase from the d1 paper: https://github.com/dllm-reasoning/d1/tree/main/diffu-grpo
We thank the authors for making their implementation publicly available, which helped facilitate this work.
If you have any questions or concerns, feel free to contact us:
You can also open an issue in this repository.
If you find our codebase or research useful, please consider citing our work:
@article{kunde2026Reinforcement,
title={Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages},
author={Kunde, Vishnu Teja and Doudi, Fatemeh and Farahbakhsh, Mahdi and Kalathil, Dileep and Narayanan, Krishna and Chamberland, Jean-Francois},
journal={arXiv preprint arXiv:2603.12554},
year={2026}
}