Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Domino is a speculative decoding method that keeps draft generation block-parallel while adding a lightweight causal correction head to improve draft-token acceptance.

News

[2026-05-30] 🔥🔥 Domino training code released! The training implementation is now available in SpecForge via sgl-project/SpecForge#571.
[2026-05-29] 🔥 Domino paper released! Read the paper on arXiv.

Demo

Supported Models

Target model	Draft model
`Qwen/Qwen3-4B`	`Huang2020/Qwen3-4B-Domino-b16`
`Qwen/Qwen3-8B`	`Huang2020/Qwen3-8B-Domino-b16`

Installation

Use Python 3.10 or newer on a CUDA GPU machine. Install a PyTorch build that matches your CUDA driver, then install the remaining Hugging Face benchmark dependencies:

python -m pip install --upgrade pip
python -m pip install -r requirements-hf.txt

For the SGLang benchmark, install the extra build tools first. On Ubuntu:

sudo apt-get update
sudo apt-get install -y build-essential ninja-build protobuf-compiler

The SGLang branch also builds a Rust component. Install Rust if cargo is not already available:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"

Then install the Domino-compatible SGLang branch in the same Python environment:

git clone --branch sglang-feat/dflash-domino https://github.com/jianuo-huang/Domino.git sglang-domino
cd sglang-domino
python -m pip install -e ./python
python -m pip install --force-reinstall --no-deps sglang-kernel \
  --index-url https://docs.sglang.ai/whl/cu130/
cd -

This SGLang branch currently resolves to PyTorch 2.11 CUDA 13 wheels. Use the matching SGLang kernel wheel above, and verify that your NVIDIA driver is new enough for CUDA 13 runtime libraries.

For CUDA 12.8 / PyTorch 2.9, patch the SGLang dependency pins before installing:

git clone --branch sglang-feat/dflash-domino https://github.com/jianuo-huang/Domino.git sglang-domino
cd sglang-domino

python -m pip install --upgrade pip

sed -i \
  -e 's/"torch==2.11.0"/"torch==2.9.1+cu128"/' \
  -e 's/"torchaudio==2.11.0"/"torchaudio==2.9.1+cu128"/' \
  -e 's/"torchvision"/"torchvision==0.24.1+cu128"/' \
  -e 's/"kernels"/"kernels==0.14.1"/' \
  -e '/"sglang-kernel==0.4.2"/d' \
  python/pyproject.toml

python -m pip install \
  --extra-index-url https://download.pytorch.org/whl/cu128 \
  -e ./python
python -m pip install --force-reinstall --no-deps "${SGLANG_KERNEL_CU12_WHEEL}"
cd -

Set SGLANG_KERNEL_CU12_WHEEL to a CUDA-12-compatible sglang-kernel wheel before running the last command. Do not install the cu130 wheel in a PyTorch 2.9/cu128 environment.

Quick Usage

Domino draft checkpoints provide spec_generate for direct speculative decoding with a target model. We currently recommend running this path on one GPU.

from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

draft_model = AutoModel.from_pretrained(
    "Huang2020/Qwen3-8B-Domino-b16",
    trust_remote_code=True,
    dtype="auto",
    device_map="cuda:0",
).eval()

target_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B",
    dtype="auto",
    device_map="cuda:0",
).eval()

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
prompt = "How many positive whole-number divisors does 196 have?"
messages = [{"role": "user", "content": prompt}]

# The Domino draft model is trained for Qwen3 with thinking mode disabled.
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
model_inputs = tokenizer([text], return_tensors="pt").to(draft_model.device)

output_ids = draft_model.spec_generate(
    input_ids=model_inputs["input_ids"],
    target=target_model,
    max_new_tokens=2048,
    temperature=0.0,
    stop_token_ids=[tokenizer.eos_token_id],
)

generated_ids = output_ids[:, model_inputs["input_ids"].shape[1]:]
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

Hugging Face Benchmark

DRAFT_MODEL=Huang2020/Qwen3-8B-Domino-b16 \
TARGET_MODEL=Qwen/Qwen3-8B \
PYTHON=python \
./run_hf_benchmark.sh

Defaults:

TASKS=gsm8k:128
MAX_NEW_TOKENS=2048
TEMPERATURE=0.0
BLOCK_SIZE=16
NUM_GPUS=8

Override tasks or runtime settings with environment variables:

TASKS="gsm8k:128,math500:128" NUM_GPUS=4 ./run_hf_benchmark.sh

SGLang Benchmark

DRAFT_MODEL=Huang2020/Qwen3-8B-Domino-b16 \
TARGET_MODEL=Qwen/Qwen3-8B \
PYTHON=python \
./run_sglang_benchmark.sh

Defaults:

TASKS=gsm8k:128
MAX_NEW_TOKENS=2048
TEMPERATURE=0.0
CONCURRENCIES=1,2,4,8,16,32

Use these sample counts to reproduce the paper settings:

TASKS="gsm8k:128,math500:128,aime24:30,aime25:30,humaneval:164,mbpp:128,livecodebench:128,swe-bench:128,mt-bench:80,alpaca:128"

Override tasks or runtime settings with environment variables:

TASKS="mt-bench:80,alpaca:128" CONCURRENCIES=1 ./run_sglang_benchmark.sh

Acknowledgements

We thank the authors and maintainers of DFlash, SpecForge, FlashInfer, and SGLang. Their open-source work on block-parallel speculative decoding, speculative-decoding training infrastructure, high-performance attention kernels, and LLM serving helped shape this project and its benchmarking setup.

Citation

If you use Domino in your research, please cite:

@article{huang2026domino,
  title={Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding},
  author={Huang, Jianuo and Zhang, Yaojie and Zhang, Qituan and Lin, Hao and Xu, Hanlin and Zhang, Linfeng},
  journal={arXiv preprint arXiv:2605.29707},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
asset		asset
code		code
.gitignore		.gitignore
README.md		README.md
requirements-hf.txt		requirements-hf.txt
run_hf_benchmark.sh		run_hf_benchmark.sh
run_sglang_benchmark.sh		run_sglang_benchmark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

News

Demo

Supported Models

Installation

Quick Usage

Hugging Face Benchmark

SGLang Benchmark

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

News

Demo

Supported Models

Installation

Quick Usage

Hugging Face Benchmark

SGLang Benchmark

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages