- [2026-05-30] 🔥🔥 Domino training code released! The training implementation is now available in SpecForge via sgl-project/SpecForge#571.
- [2026-05-29] 🔥 Domino paper released! Read the paper on arXiv.
| Target model | Draft model |
|---|---|
Qwen/Qwen3-4B |
Huang2020/Qwen3-4B-Domino-b16 |
Qwen/Qwen3-8B |
Huang2020/Qwen3-8B-Domino-b16 |
Use Python 3.10 or newer on a CUDA GPU machine. Install a PyTorch build that matches your CUDA driver, then install the remaining Hugging Face benchmark dependencies:
python -m pip install --upgrade pip
python -m pip install -r requirements-hf.txtFor the SGLang benchmark, install the extra build tools first. On Ubuntu:
sudo apt-get update
sudo apt-get install -y build-essential ninja-build protobuf-compilerThe SGLang branch also builds a Rust component. Install Rust if cargo is not already available:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"Then install the Domino-compatible SGLang branch in the same Python environment:
git clone --branch sglang-feat/dflash-domino https://github.com/jianuo-huang/Domino.git sglang-domino
cd sglang-domino
python -m pip install -e ./python
python -m pip install --force-reinstall --no-deps sglang-kernel \
--index-url https://docs.sglang.ai/whl/cu130/
cd -This SGLang branch currently resolves to PyTorch 2.11 CUDA 13 wheels. Use the matching SGLang kernel wheel above, and verify that your NVIDIA driver is new enough for CUDA 13 runtime libraries.
For CUDA 12.8 / PyTorch 2.9, patch the SGLang dependency pins before installing:
git clone --branch sglang-feat/dflash-domino https://github.com/jianuo-huang/Domino.git sglang-domino
cd sglang-domino
python -m pip install --upgrade pip
sed -i \
-e 's/"torch==2.11.0"/"torch==2.9.1+cu128"/' \
-e 's/"torchaudio==2.11.0"/"torchaudio==2.9.1+cu128"/' \
-e 's/"torchvision"/"torchvision==0.24.1+cu128"/' \
-e 's/"kernels"/"kernels==0.14.1"/' \
-e '/"sglang-kernel==0.4.2"/d' \
python/pyproject.toml
python -m pip install \
--extra-index-url https://download.pytorch.org/whl/cu128 \
-e ./python
python -m pip install --force-reinstall --no-deps "${SGLANG_KERNEL_CU12_WHEEL}"
cd -Set SGLANG_KERNEL_CU12_WHEEL to a CUDA-12-compatible sglang-kernel wheel before running the last command. Do not install the cu130 wheel in a PyTorch 2.9/cu128 environment.
Domino draft checkpoints provide spec_generate for direct speculative decoding with a target model. We currently recommend running this path on one GPU.
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
draft_model = AutoModel.from_pretrained(
"Huang2020/Qwen3-8B-Domino-b16",
trust_remote_code=True,
dtype="auto",
device_map="cuda:0",
).eval()
target_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-8B",
dtype="auto",
device_map="cuda:0",
).eval()
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
prompt = "How many positive whole-number divisors does 196 have?"
messages = [{"role": "user", "content": prompt}]
# The Domino draft model is trained for Qwen3 with thinking mode disabled.
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
model_inputs = tokenizer([text], return_tensors="pt").to(draft_model.device)
output_ids = draft_model.spec_generate(
input_ids=model_inputs["input_ids"],
target=target_model,
max_new_tokens=2048,
temperature=0.0,
stop_token_ids=[tokenizer.eos_token_id],
)
generated_ids = output_ids[:, model_inputs["input_ids"].shape[1]:]
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))DRAFT_MODEL=Huang2020/Qwen3-8B-Domino-b16 \
TARGET_MODEL=Qwen/Qwen3-8B \
PYTHON=python \
./run_hf_benchmark.shDefaults:
TASKS=gsm8k:128MAX_NEW_TOKENS=2048TEMPERATURE=0.0BLOCK_SIZE=16NUM_GPUS=8
Override tasks or runtime settings with environment variables:
TASKS="gsm8k:128,math500:128" NUM_GPUS=4 ./run_hf_benchmark.shDRAFT_MODEL=Huang2020/Qwen3-8B-Domino-b16 \
TARGET_MODEL=Qwen/Qwen3-8B \
PYTHON=python \
./run_sglang_benchmark.shDefaults:
TASKS=gsm8k:128MAX_NEW_TOKENS=2048TEMPERATURE=0.0CONCURRENCIES=1,2,4,8,16,32
Use these sample counts to reproduce the paper settings:
TASKS="gsm8k:128,math500:128,aime24:30,aime25:30,humaneval:164,mbpp:128,livecodebench:128,swe-bench:128,mt-bench:80,alpaca:128"Override tasks or runtime settings with environment variables:
TASKS="mt-bench:80,alpaca:128" CONCURRENCIES=1 ./run_sglang_benchmark.shWe thank the authors and maintainers of DFlash, SpecForge, FlashInfer, and SGLang. Their open-source work on block-parallel speculative decoding, speculative-decoding training infrastructure, high-performance attention kernels, and LLM serving helped shape this project and its benchmarking setup.
If you use Domino in your research, please cite:
@article{huang2026domino,
title={Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding},
author={Huang, Jianuo and Zhang, Yaojie and Zhang, Qituan and Lin, Hao and Xu, Hanlin and Zhang, Linfeng},
journal={arXiv preprint arXiv:2605.29707},
year={2026}
}