Diffusion-State Policy Optimization for Masked Diffusion Language Models

Daisuke Oba1, Hiroki Furuta, Naoaki Okazaki1
1Institute of Science Tokyo
Pre-print 2026

Abstract

Masked diffusion language models generate text through iterative masked-token filling, but terminal-only rewards on final completions provide coarse credit assignment for the intermediate filling decisions that shape the generation process. We propose Diffusion-State Policy Optimization (DiSPO), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens, requiring no additional multi-step diffusion rollouts or optimizer steps. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that reuses the same rollouts as terminal-feedback policy optimization. Experiments on LLaDA-8B-Instruct show that DiSPO consistently improves terminal-feedback baselines, including diffu-GRPO and SPG, on math and planning benchmarks under matched rollout compute and optimizer steps, supporting its use as a general plug-in for masked diffusion policy optimization.

DiSPO: Diffusion-State Policy Optimization

Algorithm

Experiments

BibTeX

@article{oba2026dispo,
  title = {Diffusion-State Policy Optimization for Masked Diffusion Language Models},
  author = {Oba, Daisuke and Furuta, Hiroki and Okazaki, Naoaki},
  journal = {arXiv preprint arXiv:2602.06462},
  year = {2026},
  url = {https://arxiv.org/abs/2602.06462}
}