FastMTP: Accelerating LLM Inference with
Enhanced Multi-Token Prediction

Introduction

We present FastMTP, a simple yet effective method that enhances Multi-Token Prediction (MTP) for speculative decoding during inference. Our approach fine-tunes a single MTP head with shared weights across multiple causal draft steps, enabling it to capture longer-range dependencies and achieve higher acceptance rates in speculative decoding. By integrating language-aware vocabulary compression into the MTP head, we further reduce computational overhead during draft generation. Experimental results across diverse benchmarks demonstrate that FastMTP achieves an average of 2.03× speedup compared over vanilla next token prediction while maintaining lossless output quality. With low training cost and seamless integration into existing inference frameworks, FastMTP offers a practical and rapidly deployable solution for accelerating LLM inference.

🌟 For more details, please refer to our technical report.

Speedup comparison of different methods across subtasks, evaluated on a single A10 GPU.

⚙️ Installation

Training Environment

Model training was performed on H20 GPUs with the following environment:

python 3.10
torch 2.7.1+cu128
flash_attn 2.8.2

conda create -n env-3.10 python=3.10 -y
conda activate env-3.10
pip install -r requirements.txt

cd ms-swift-3.6.4
pip install -e .

cd ..
cd transformers-4.54.0
pip install -e .

Evaluation Environment

Evaluation experiments were mainly conducted on a single A10 GPU with the following environment:

python 3.12.11
torch 2.8.0
cuda 12.8

pip install sglang[all]

🚀 Getting Started

Prepare model

Download the corresponding model weights to the model folder.

# Make sure git-lfs is installed
cd model
git clone https://huggingface.co/XiaomiMiMo/MiMo-7B-RL

# Replace with our customized configuration
cp config.json MiMo-7B-RL/config.json

Launch Training

sh sft.sh

Evaluation

Step 1: Launch a server

Start the server with the following command:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

python3 -m sglang.launch_server \
        --model-path <model_path> \
        --trust-remote-code \
        --mem-fraction-static 0.7 \
        --max-running-requests 1 \
        --tensor-parallel-size 1 \
        --cuda-graph-max-bs 1 \
        --speculative-algorithm EAGLE \
        --speculative-num-steps 3 \
        --speculative-eagle-topk 1 \
        --speculative-num-draft-tokens 4 \
        --speculative-token-map <freq_map_path>

<model_path>: Path to the model folder
<freq_map_path>: Path to the high-frequency token vocabulary

Note: Use the original config.json in <model_path> for evaluation.

Model weights are available on Huggingface (see here).

Processed language-aware high-frequency token vocabularies (based on the Qwen2Tokenizer) are available for download on Huggingface:

English: Qwen2-7B-Instruct-FR-Spec
Chinese: MiMo-7B-RL-FR-Spec-zh

Step 2: Run the benchmark

Open a new terminal and execute:

cd evaluation/<benchmark>
python3 bench_sglang_eagle.py

# Example: MT-Bench evaluation
cd evaluation/mt_bench
python3 bench_sglang_eagle.py \
  --question-file question.jsonl \
  --num-questions 80 \
  --temperature 0 \
  --max-gen-length 1024 \
  --answer-file <answer_file> \
  --result-file <result_file>

You can also send a request and stream the output:

import openai

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

# Use stream=True for streaming responses
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "user", "content": "What is the capital of France?"},
    ],
    temperature=0,
    max_tokens=2048,
    stream=True,
)

# Handle the streaming output
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Acknowledgments

ms-swift and transformers: Training frameworks we modified and built upon.
SGLang: Codebase used for inference.
EAGLE and thunlp/FR-Spec: Key works that inspired our approach.
XiaomiMiMo/MiMo-7B-RL: Backbone model for our experiments.

Citation

If you find the resources in this repository useful, please cite our paper:

@article{cai2025fastmtp,
  title={FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction},
  author={Cai, Yuxuan and Liang, Xiaozhuan and Wang, Xinghua and Ma, Jin and Liang, Haijin and Luo, Jinwen and Zuo, Xinyu and Duan, Lisheng and Yin, Yuyang and Chen, Xi},
  journal={arXiv preprint arXiv:2509.18362},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
evaluation		evaluation
model		model
ms-swift-3.6.4		ms-swift-3.6.4
transformers-4.54.0		transformers-4.54.0
FastMTP_technical_report.pdf		FastMTP_technical_report.pdf
README.md		README.md
install.sh		install.sh
requirements.txt		requirements.txt
sft.sh		sft.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FastMTP: Accelerating LLM Inference with
Enhanced Multi-Token Prediction

Introduction

⚙️ Installation

Training Environment

Evaluation Environment

🚀 Getting Started

Prepare model

Launch Training

Evaluation

Step 1: Launch a server

Step 2: Run the benchmark

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Languages

Tencent-BAC/FastMTP

Folders and files

Latest commit

History

Repository files navigation

FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction

Introduction

⚙️ Installation

Training Environment

Evaluation Environment

🚀 Getting Started

Prepare model

Launch Training

Evaluation

Step 1: Launch a server

Step 2: Run the benchmark

Acknowledgments

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

FastMTP: Accelerating LLM Inference with
Enhanced Multi-Token Prediction

Packages