Support overlap-spec-v2 with trtllm_mla attention backend #11821

Qiaolin-Yu · 2025-10-19T07:50:32Z

Motivation

Modifications

To open this feature, use --enable-beta-spec
This pr can increase the peformance of bs=1 from 250 to 280 token/s

After this pr, it's still not fully overlap (figure 1) because these three kernels are too slow (figure 2). Will fix this in the future pr.

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-10-19T07:50:43Z

Summary of Changes

Hello @Qiaolin-Yu, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates support for the overlap-spec-v2 mode into the trtllm_mla attention backend. This involves modifying the attention layer's forward pass and metadata initialization to correctly process requests under this new speculative decoding mode. Additionally, a cache clearing utility is introduced in the eagle_worker_v2 to manage resources more effectively.

Highlights

TRTLLM MLA Backend Update: The trtllm_mla attention backend now explicitly supports the overlap-spec-v2 mode by recognizing is_draft_extend_v2 in its forward pass logic.
KV Cache Management: The init_forward_metadata method in the trtllm_mla backend has been updated to correctly handle is_draft_extend_v2 to ensure proper KV cache initialization.
Speculative Decoding Worker Enhancement: A new clear_cache_pool method has been added to eagle_worker_v2 to facilitate the clearing of request-to-token and token-to-KV cache pools, likely for better resource management in speculative decoding.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for overlap-spec-v2 with the trtllm_mla attention backend. The changes involve updating the attention backend to recognize the new DRAFT_EXTEND_V2 forward mode. Additionally, a utility method clear_cache_pool is added to EAGLEWorkerV2 for cleaning up memory pools, likely for testing purposes. My review includes a suggestion to simplify the implementation of this new utility method for better clarity and maintainability.

gemini-code-assist · 2025-10-19T07:52:56Z

python/sglang/srt/speculative/eagle_worker_v2.py

+        self.draft_worker.req_to_token_pool.clear()
+        self.draft_worker.token_to_kv_pool_allocator.clear()


The EAGLEWorkerV2 class already has direct access to req_to_token_pool and token_to_kv_pool_allocator. Accessing them through self.draft_worker is an unnecessary indirection. For better clarity and more direct code, you can call .clear() on the attributes of self directly.

Suggested change

self.draft_worker.req_to_token_pool.clear()

self.draft_worker.token_to_kv_pool_allocator.clear()

self.req_to_token_pool.clear()

self.token_to_kv_pool_allocator.clear()

JustinTong0323

Launch server:

SGLANG_USE_CUTLASS_BACKEND_FOR_FP4_GEMM=1 TRTLLM_ENABLE_PDL=1 python3 -m sglang.launch_server --model-path nvidia/DeepSeek-V3-0324-FP4 --trust-remote-code --quantization modelopt_fp4 --tp 8  --speculative-algorithm=EAGLE  --port 40020  --enable-beta-spec --kv-cache-dtype fp8_e4m3

Bench:

python3 -m sglang.bench_serving --backend sglang-oai --dataset-name random --random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.98 --num-prompts 5 --max-concurrency 1 --output-file res.jsonl --port 40020

Before this PR (without --enable-beta-spec):

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     5
Benchmark duration (s):                  20.30
Total input tokens:                      5060
Total input text tokens:                 5060
Total input vision tokens:               0
Total generated tokens:                  5062
Total generated tokens (retokenized):    5022
Request throughput (req/s):              0.25
Input token throughput (tok/s):          249.32
Output token throughput (tok/s):         249.42
Total token throughput (tok/s):          498.74
Concurrency:                             1.00
Accept length:                           2.71
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4057.04
Median E2E Latency (ms):                 3903.28
---------------Time to First Token----------------
Mean TTFT (ms):                          103.08
Median TTFT (ms):                        99.68
P99 TTFT (ms):                           119.09
---------------Inter-Token Latency----------------
Mean ITL (ms):                           10.60
Median ITL (ms):                         10.63
P95 ITL (ms):                            11.47
P99 ITL (ms):                            11.94
Max ITL (ms):                            19.41
==================================================

After this PR:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     5
Benchmark duration (s):                  17.30
Total input tokens:                      5060
Total input text tokens:                 5060
Total input vision tokens:               0
Total generated tokens:                  5062
Total generated tokens (retokenized):    5022
Request throughput (req/s):              0.29
Input token throughput (tok/s):          292.44
Output token throughput (tok/s):         292.56
Total token throughput (tok/s):          585.01
Concurrency:                             1.00
Accept length:                           2.74
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   3458.21
Median E2E Latency (ms):                 3434.07
---------------Time to First Token----------------
Mean TTFT (ms):                          208.59
Median TTFT (ms):                        117.10
P99 TTFT (ms):                           519.46
---------------Inter-Token Latency----------------
Mean ITL (ms):                           8.77
Median ITL (ms):                         8.72
P95 ITL (ms):                            9.29
P99 ITL (ms):                            9.75
Max ITL (ms):                            17.92
==================================================

Qiaolin-Yu requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, ispobock, kssteven418, kushanam, merrymercy and zhyncs as code owners October 19, 2025 07:50

sglang-bot added the run-ci label Oct 19, 2025

Qiaolin-Yu requested a review from hnyls2002 October 19, 2025 07:50

Qiaolin-Yu assigned hnyls2002 Oct 19, 2025

gemini-code-assist bot reviewed Oct 19, 2025

View reviewed changes

hnyls2002 mentioned this pull request Oct 19, 2025

[Feature] Overlap Spec Support #11762

Open

25 tasks

Qiaolin-Yu marked this pull request as draft October 21, 2025 03:07

upd

72c181f

Qiaolin-Yu force-pushed the overlap branch from 46e2839 to 72c181f Compare October 21, 2025 06:37

Qiaolin-Yu marked this pull request as ready for review October 21, 2025 06:38

JustinTong0323 self-assigned this Oct 22, 2025

Qiaolin-Yu added 2 commits October 22, 2025 04:12

upd

9237216

Merge remote-tracking branch 'upstream/main' into overlap

8b6c5ac

Qiaolin-Yu force-pushed the overlap branch from 2a8719e to 8b6c5ac Compare October 22, 2025 04:13

Qiaolin-Yu added 2 commits October 22, 2025 04:14

fix

055c000

refine

6b24d29

JustinTong0323 approved these changes Oct 22, 2025

View reviewed changes

Qiaolin-Yu mentioned this pull request Oct 23, 2025

Accelerate deepseek fp4 b200 ci #11993

Merged

4 tasks

Qiaolin-Yu added the high priority label Oct 23, 2025

Qiaolin-Yu assigned ispobock Oct 23, 2025

ispobock approved these changes Oct 23, 2025

View reviewed changes

ispobock merged commit 36a4cad into sgl-project:main Oct 23, 2025
154 of 196 checks passed

Qiaolin-Yu deleted the overlap branch October 25, 2025 23:13

TJ5 mentioned this pull request Oct 30, 2025

[Bug] Eagle3 overlap scheduling decreases performance #12411

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support overlap-spec-v2 with trtllm_mla attention backend #11821

Support overlap-spec-v2 with trtllm_mla attention backend #11821

Qiaolin-Yu commented Oct 19, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Oct 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 19, 2025

Uh oh!

JustinTong0323 left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		self.draft_worker.req_to_token_pool.clear()
		self.draft_worker.token_to_kv_pool_allocator.clear()

Support overlap-spec-v2 with trtllm_mla attention backend #11821

Support overlap-spec-v2 with trtllm_mla attention backend #11821

Conversation

Qiaolin-Yu commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Oct 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

JustinTong0323 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Qiaolin-Yu commented Oct 19, 2025 •

edited

Loading

JustinTong0323 left a comment •

edited

Loading