Skip to content

z-lab/dflash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

45 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DFlash: Block Diffusion for Flash Speculative Decoding

Paper | Blog | Models

DFlash is a lightweight block diffusion model designed for speculative decoding. It enables efficient and high-quality parallel drafting.

DFlash Architecture
DFlash_demo.mp4

πŸ“¦ Model Support Plan

βœ… Supported

🚧 Coming Soon

  • Qwen3.5-122B-A10B
  • GLM-4.7-Flash

πŸ’‘ Feel free to open a GitHub issue if you’d like to request support for additional models!
We will also open-source the training recipe soon, so you can train your own DFlash draft model to accelerate any LLM.


πŸš€ Quick Start

Installation

conda create -n dflash python=3.12
conda activate dflash

git clone https://github.com/z-lab/dflash.git
cd dflash

pip install uv
uv pip install -r requirements.txt

# Optionally install flash-attn.
# If unavailable, evaluation falls back to torch.sdpa in the Transformers backend.
# The measured speedup will be slower, but the acceptance length remains comparable.

# uv pip install flash-attn --no-build-isolation

SGLang

export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_DFLASH_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-Coder-30B-A3B-Instruct \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3-Coder-30B-A3B-DFlash \
    --tp-size 1 \
    --dtype bfloat16 \
    --attention-backend fa3 \
    --mem-fraction-static 0.75 \
    --trust-remote-code

vLLM

Please install the latest vLLM from source.

vllm serve Qwen/Qwen3.5-27B \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768

Transformers

Only Qwen3 and LLaMA-3.1 models support Transformers backend.

from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

model = AutoModel.from_pretrained(
    "z-lab/Qwen3-8B-DFlash-b16", 
    trust_remote_code=True, 
    dtype="auto", 
    device_map="cuda:0"
).eval()

target = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B", 
    dtype="auto", 
    device_map="cuda:0"
).eval()

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
prompt = "How many positive whole-number divisors does 196 have?"
messages = [
    {"role": "user", "content": prompt}
]
# Note: this draft model is used for thinking mode disabled
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generate_ids = model.spec_generate(
    input_ids=model_inputs["input_ids"], 
    max_new_tokens=2048, 
    temperature=0.0, 
    target=target, 
    stop_token_ids=[tokenizer.eos_token_id]
)

print(tokenizer.decode(generate_ids[0], skip_special_tokens=False))

πŸ“Š Evaluation

We provide scripts to reproduce the speedup and acceptance length metrics in the paper. The reported results were tested on NVIDIA H200 or B200 GPUs. Please note that only Qwen3 series and LLaMA-3.1 models support Transformers backend benchmark. For other models please use SGLang to run the benchmarks.

To run benchmark on Transformers backend:

bash run_benchmark.sh

To run benchmark on SGLang:

bash run_sglang_benchmark.sh

To run benchmark on vLLM, first launch the server, then run:

vllm bench serve \
  --backend openai-chat \
  --base-url http://127.0.0.1:8000 \
  --endpoint /v1/chat/completions \
  --dataset-name custom \
  --dataset-path ./humaneval.jsonl \
  --custom-output-len 4096 \
  --num-prompts 1024 \
  --max-concurrency 32 \
  --model Qwen/Qwen3.5-27B \
  --temperature 0.0

Acknowledgement

Huge thanks to @dcw02, @gongy, and the team at @modal-labs for their fast, high-quality support in bringing DFlash to SGLang. And huge thanks as well to @benchislett at NVIDIA for his work in bringing DFlash to vLLM and helping make it available to the broader serving community.

Citation

If you find DFlash useful, please cite our work. To share feedback on DFlash or request new model support, please fill out this form: DFlash Feedback.

@article{chen2026dflash,
  title   = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
  journal = {arXiv preprint arXiv:2602.06036},
  year    = {2026}
}

About

DFlash: Block Diffusion for Flash Speculative Decoding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors