GitHub - vishnutez/egspo-dllm-rl: Repository for official implementation of Entropy-guided Policy-gradient with Stepwise Advantages (EPSA) for RL fine-tuning of diffusion Large Language Models (dLLMs)

🚀 Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

Entropy-Guided Stepwise Policy Optimization with Stepwise Advantages (EGSPO-SA) introduces a reinforcement learning framework for diffusion language models (dLLMs) 🤖. Unlike autoregressive LLMs, diffusion models generate sequences through an iterative denoising process, making standard sequence-level RL fine-tuning challenging.

We formulate the denoising trajectory as a finite-horizon Markov decision process 🔁 and derive a policy-gradient objective that decomposes across denoising steps. Our method focuses learning on the most informative steps and introduces a lightweight stepwise advantage estimator ⚡ for efficient training.

✨ Key Contributions

🧠 Diffusion-MDP formulation for RL fine-tuning of diffusion language models
📊 Entropy-guided step selection to identify the most informative denoising steps
⚡ EGSPO-SA, a lightweight stepwise advantage estimator that avoids separate value models
🏆 Strong empirical results on coding, logical reasoning, and mathematical reasoning benchmarks

🌈 Overview

⚙️ Step 1: Environment Setup

Clone the repository and install dependencies:

git clone https://github.com/vishnutez/egspo-dllm-rl.git
cd egspo-dllm-rl
conda env create -f environment.yml
conda activate egspo-env

🏋️ Training

Configure the required environment variables (e.g., WANDB_API_KEY, HF_HOME, etc.) in egspo/train.sh

We provide an multi-node sbatch script for running experiments on a cluster. The script can also be easily adapted to a standard .sh file if needed.

Run training with:

sbatch egspo/train.sh

Unless otherwise specified in the paper, the default parameters in epsa/train.sh correspond to the configurations used in our experiments.

For Dream-7B-Instruct, the train and trainer scripts are located in egspo/DREAM/.

📊 Evaluation

⚡ Step 1: Generate completions

Before running evaluation, update the required fields in:

eval/eval_checkpoints.sh

In particular, set the following variables:

CHECKPOINT_DIR — directory containing the trained model checkpoints to evaluate
OUTPUT_DIR — directory where generated completions will be saved
TASKS — evaluation task(s) (e.g., gsm8k, sudoku, etc.)
GEN_LENGTHS — generation lengths to evaluate
<YOUR_CONDA_ENV_NAME> — your conda environment name
<YOUR_HF_HOME_DIR> (optional) — Hugging Face cache directory (remove if using the default)

Then run:

bash eval/eval_checkpoints.sh

This step generates model completions for the test prompts using the selected checkpoints.
The script also parses predicted answers from model outputs and extracts ground-truth answers from the dataset, preparing them for evaluation.

📈 Step 2: Compute metrics

Modify the following fields in:

eval/get_and_save_metrics.py

task
checkpoint_dir
generated_lengths

Using the completions generated in the previous step, this script computes evaluation metrics by comparing predicted answers with ground-truth answers.
The results are then saved as .json files for each evaluated checkpoint.

🤗 Checkpoints

Task-specific checkpoints and usage instructions are available on HuggingFace:

🔗 fatemehdoudi97/egspo-llada-8b

🔗 fatemehdoudi97/egspo-dream-7b

🙏 Acknowledgement

Our implementation builds upon the codebase from the d1 paper: https://github.com/dllm-reasoning/d1/tree/main/diffu-grpo

We thank the authors for making their implementation publicly available, which helped facilitate this work.

📬 Contact

If you have any questions or concerns, feel free to contact us:

You can also open an issue in this repository.

📜 Citation

If you find our codebase or research useful, please consider citing our work:

@article{kunde2026Reinforcement,
  title={Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages},
  author={Kunde, Vishnu Teja and Doudi, Fatemeh and Farahbakhsh, Mahdi and Kalathil, Dileep and Narayanan, Krishna and Chamberland, Jean-Francois},
  journal={arXiv preprint arXiv:2603.12554},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
assets		assets
dataset		dataset
egspo		egspo
eval		eval
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

✨ Key Contributions

🌈 Overview

⚙️ Step 1: Environment Setup

🏋️ Training

📊 Evaluation

⚡ Step 1: Generate completions

📈 Step 2: Compute metrics

🤗 Checkpoints

🙏 Acknowledgement

📬 Contact

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

✨ Key Contributions

🌈 Overview

⚙️ Step 1: Environment Setup

🏋️ Training

📊 Evaluation

⚡ Step 1: Generate completions

📈 Step 2: Compute metrics

🤗 Checkpoints

🙏 Acknowledgement

📬 Contact

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages