This repository is the training and method repository for Multi-Block Diffusion Language Models (MBD-LMs). It defines the paradigm and contains the training-side assets needed to build MBD-LMs:
- Multi-block Teacher Forcing (MultiTF) training code and configs;
- dataset preparation and training setup guidelines;
- multi-node training launch scripts;
- checkpoint conversion utilities;
- the project page and method documentation.
Block Diffusion Language Models (BD-LMs) support KV caching and flexible-length generation, but native BD-LMs usually decode with Single-Block Diffusion (SingleBD): each forward pass refines one noisy block while later blocks wait for the current block to be completed and cached. This creates KV-cache storing bubbles and leaves inter-block parallelism underused.
MBD-LMs target Multi-Block Diffusion (MultiBD), where a bounded running-set of consecutive blocks is decoded concurrently. We introduce Multi-block Teacher Forcing (MultiTF) for train-inference alignment and a Block Buffer inference mechanism for efficient static-shape execution.
The repository roles are split intentionally. Use the training repository for model-side work, and use Diffulex for inference and systems work:
| Repository / branch | Role |
|---|---|
SJTU-DENG-Lab/mbd-lms |
Training and method repository: MultiTF, training configs, dataset setup, checkpoint conversion, and paper/project documentation. |
Diffulex mbd-lms |
Experiment reproduction branch for running the reported MBD-LMs inference/evaluation setup. |
Diffulex main |
Active inference engine branch for runtime development, open-source contributions, and new dLLM decoding algorithms. |
Start from this repository when you are working on MultiTF training, data preparation, or checkpoint conversion:
git clone https://github.com/SJTU-DENG-Lab/mbd-lms.git
cd mbd-lmsThen follow the guides:
For the reported MBD-LMs inference/evaluation setup, use the Diffulex
mbd-lms branch:
git clone https://github.com/SJTU-DENG-Lab/Diffulex.git
cd Diffulex
git checkout mbd-lmsReproducibility note. The scores and throughput numbers reported in the
paper were produced with the Diffulex mbd-lms branch. Use this branch to
reproduce the paper tables, including the throughput table. The actively
optimized Diffulex main branch may produce different latency, TPS, or
benchmark numbers because the runtime has continued to change after the paper
experiments.
For new runtime features, open-source contributions, and new dLLM decoding
algorithms, use Diffulex main:
git checkout main-
Multi-Block Diffusion formulation.
We formulate MBD-LMs as BD-LMs that recover a bounded running-set of consecutive blocks conditioned on a clean cached prefix. -
MultiTF post-training.
MultiTF trains BD-LMs on bounded noisy block groups with heterogeneous slot-wise mask ratios, matching practical MultiBD inference states. -
Block Buffer inference.
A fixed-size Block Buffer preserves prefix-cache reuse, enables decode-store overlap, and keeps tensor shapes static for CUDA Graph-friendly execution. -
Improved parallelism and throughput.
On math and code benchmarks, MBD-LLaDA2-Mini increases average TPF from 3.47 to 6.19 while improving average accuracy from 79.95% to 81.03%. With DMax, MBD-LLaDA2-Mini-DMax reaches 9.34 average TPF.
MultiTF post-trains BD-LMs with bounded noise-groups that approximate the running-set states seen during MultiBD inference. It combines systematic and random group layouts, applies a chain-uniform noise scheduler to create heterogeneous slot-wise mask ratios, and uses a Group-Aware Dual-Stream Mask to control visibility between noisy and clean blocks.
Naive MultiBD has a dynamic running-set whose length changes during decoding, which is inefficient for static-shape execution. The Block Buffer mechanism instead maintains a fixed number of physical block slots. Future blocks enter by activating dummy slots, and completed blocks are committed into the KV cache.
Each slot follows the transition:
dummy -> active -> to-cache -> in-cache
This design exposes inter-block parallelism while preserving the serving advantages of BD-LMs.
We evaluate on GSM8K, MATH500, MBPP+, and HumanEval+. Accuracy is exact match for math and pass@1 for code. TPF denotes Tokens Per Forward pass, and AUP summarizes the accuracy-parallelism trade-off.
| Base Model | Native Avg. Acc. | Native Avg. TPF | MBD Avg. Acc. | MBD Avg. TPF | AUP: Native -> MBD |
|---|---|---|---|---|---|
| LLaDA2-Mini-DMax | 79.59 | 6.35 | 78.57 | 9.34 | 459.54 -> 661.28 |
| LLaDA2-Mini | 79.95 | 3.47 | 81.03 | 6.19 | 247.41 -> 449.18 |
| SDAR-8B-Chat-b32 | 69.00 | 2.54 | 69.74 | 4.46 | 141.64 -> 210.42 |
| SDAR-8B-Chat-b4 | 75.59 | 1.25 | 75.27 | 2.42 | 85.46 -> 148.65 |
MBD-LMs consistently improve decoding parallelism over native SingleBD. MultiTF also recovers or improves quality compared with training-free MultiBD in most settings, indicating that train-inference alignment is important for reliable MultiBD.
Throughput is measured for single-sample decoding on two H100 GPUs with tensor
parallelism degree 2. These are paper-reproduction numbers from the Diffulex
mbd-lms branch. For exact reproduction, use that branch rather than the
actively optimized Diffulex main branch; newer engine versions may differ
from the reported table.
| Model | Avg. TPF | Step Latency | Avg. TPS |
|---|---|---|---|
| LLaDA2-Mini | 3.47 | 7.07 ms | 517.16 |
| MBD-LLaDA2-Mini | 6.19 | 8.78 ms | 745.92 |
| LLaDA2-Mini-DMax | 6.35 | 9.02 ms | 779.49 |
| MBD-LLaDA2-Mini-DMax | 9.34 | 11.20 ms | 926.67 |
The larger Block Buffer increases per-step latency, but the gain in useful tokens committed per forward pass leads to higher realized throughput.
This repository is released under the MIT License.