We present FastMTP, a simple yet effective method that enhances Multi-Token Prediction (MTP) for speculative decoding during inference. Our approach fine-tunes a single MTP head with shared weights across multiple causal draft steps, enabling it to capture longer-range dependencies and achieve higher acceptance rates in speculative decoding. By integrating language-aware vocabulary compression into the MTP head, we further reduce computational overhead during draft generation. Experimental results across diverse benchmarks demonstrate that FastMTP achieves an average of 2.03ร speedup compared over vanilla next token prediction while maintaining lossless output quality. With low training cost and seamless integration into existing inference frameworks, FastMTP offers a practical and rapidly deployable solution for accelerating LLM inference.
๐ For more details, please refer to our technical report.
Model training was performed on H20 GPUs with the following environment:
python 3.10torch 2.7.1+cu128flash_attn 2.8.2
conda create -n env-3.10 python=3.10 -y
conda activate env-3.10
pip install -r requirements.txt
cd ms-swift-3.6.4
pip install -e .
cd ..
cd transformers-4.54.0
pip install -e .Evaluation experiments were mainly conducted on a single A10 GPU with the following environment:
python 3.12.11torch 2.8.0cuda 12.8
pip install sglang[all]Download the corresponding model weights to the model folder.
# Make sure git-lfs is installed
cd model
git clone https://huggingface.co/XiaomiMiMo/MiMo-7B-RL
# Replace with our customized configuration
cp config.json MiMo-7B-RL/config.jsonsh sft.shStart the server with the following command:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
python3 -m sglang.launch_server \
--model-path <model_path> \
--trust-remote-code \
--mem-fraction-static 0.7 \
--max-running-requests 1 \
--tensor-parallel-size 1 \
--cuda-graph-max-bs 1 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-token-map <freq_map_path><model_path>: Path to the model folder<freq_map_path>: Path to the high-frequency token vocabulary
Note: Use the original config.json in <model_path> for evaluation.
Model weights are available on Huggingface (see here).
Processed language-aware high-frequency token vocabularies (based on the Qwen2Tokenizer) are available for download on Huggingface:
- English: Qwen2-7B-Instruct-FR-Spec
- Chinese: MiMo-7B-RL-FR-Spec-zh
Open a new terminal and execute:
cd evaluation/<benchmark>
python3 bench_sglang_eagle.py
# Example: MT-Bench evaluation
cd evaluation/mt_bench
python3 bench_sglang_eagle.py \
--question-file question.jsonl \
--num-questions 80 \
--temperature 0 \
--max-gen-length 1024 \
--answer-file <answer_file> \
--result-file <result_file>You can also send a request and stream the output:
import openai
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
# Use stream=True for streaming responses
response = client.chat.completions.create(
model="default",
messages=[
{"role": "user", "content": "What is the capital of France?"},
],
temperature=0,
max_tokens=2048,
stream=True,
)
# Handle the streaming output
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)- ms-swift and transformers: Training frameworks we modified and built upon.
- SGLang: Codebase used for inference.
- EAGLE and thunlp/FR-Spec: Key works that inspired our approach.
- XiaomiMiMo/MiMo-7B-RL: Backbone model for our experiments.
If you find the resources in this repository useful, please cite our paper:
@article{cai2025fastmtp,
title={FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction},
author={Cai, Yuxuan and Liang, Xiaozhuan and Wang, Xinghua and Ma, Jin and Liang, Haijin and Luo, Jinwen and Zuo, Xinyu and Duan, Lisheng and Yin, Yuyang and Chen, Xi},
journal={arXiv preprint arXiv:2509.18362},
year={2025}
}