Skip to content

Conversation

@Qiaolin-Yu
Copy link
Collaborator

@Qiaolin-Yu Qiaolin-Yu commented Oct 19, 2025

Motivation

Modifications

To open this feature, use --enable-beta-spec
This pr can increase the peformance of bs=1 from 250 to 280 token/s

After this pr, it's still not fully overlap (figure 1) because these three kernels are too slow (figure 2). Will fix this in the future pr.

image image

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Qiaolin-Yu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates support for the overlap-spec-v2 mode into the trtllm_mla attention backend. This involves modifying the attention layer's forward pass and metadata initialization to correctly process requests under this new speculative decoding mode. Additionally, a cache clearing utility is introduced in the eagle_worker_v2 to manage resources more effectively.

Highlights

  • TRTLLM MLA Backend Update: The trtllm_mla attention backend now explicitly supports the overlap-spec-v2 mode by recognizing is_draft_extend_v2 in its forward pass logic.
  • KV Cache Management: The init_forward_metadata method in the trtllm_mla backend has been updated to correctly handle is_draft_extend_v2 to ensure proper KV cache initialization.
  • Speculative Decoding Worker Enhancement: A new clear_cache_pool method has been added to eagle_worker_v2 to facilitate the clearing of request-to-token and token-to-KV cache pools, likely for better resource management in speculative decoding.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for overlap-spec-v2 with the trtllm_mla attention backend. The changes involve updating the attention backend to recognize the new DRAFT_EXTEND_V2 forward mode. Additionally, a utility method clear_cache_pool is added to EAGLEWorkerV2 for cleaning up memory pools, likely for testing purposes. My review includes a suggestion to simplify the implementation of this new utility method for better clarity and maintainability.

Comment on lines 701 to 702
self.draft_worker.req_to_token_pool.clear()
self.draft_worker.token_to_kv_pool_allocator.clear()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The EAGLEWorkerV2 class already has direct access to req_to_token_pool and token_to_kv_pool_allocator. Accessing them through self.draft_worker is an unnecessary indirection. For better clarity and more direct code, you can call .clear() on the attributes of self directly.

Suggested change
self.draft_worker.req_to_token_pool.clear()
self.draft_worker.token_to_kv_pool_allocator.clear()
self.req_to_token_pool.clear()
self.token_to_kv_pool_allocator.clear()

@hnyls2002 hnyls2002 mentioned this pull request Oct 19, 2025
25 tasks
@Qiaolin-Yu Qiaolin-Yu marked this pull request as draft October 21, 2025 03:07
@Qiaolin-Yu Qiaolin-Yu marked this pull request as ready for review October 21, 2025 06:38
@JustinTong0323 JustinTong0323 self-assigned this Oct 22, 2025
Copy link
Collaborator

@JustinTong0323 JustinTong0323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Launch server:

SGLANG_USE_CUTLASS_BACKEND_FOR_FP4_GEMM=1 TRTLLM_ENABLE_PDL=1 python3 -m sglang.launch_server --model-path nvidia/DeepSeek-V3-0324-FP4 --trust-remote-code --quantization modelopt_fp4 --tp 8  --speculative-algorithm=EAGLE  --port 40020  --enable-beta-spec --kv-cache-dtype fp8_e4m3

Bench:

python3 -m sglang.bench_serving --backend sglang-oai --dataset-name random --random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.98 --num-prompts 5 --max-concurrency 1 --output-file res.jsonl --port 40020

Before this PR (without --enable-beta-spec):

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     5
Benchmark duration (s):                  20.30
Total input tokens:                      5060
Total input text tokens:                 5060
Total input vision tokens:               0
Total generated tokens:                  5062
Total generated tokens (retokenized):    5022
Request throughput (req/s):              0.25
Input token throughput (tok/s):          249.32
Output token throughput (tok/s):         249.42
Total token throughput (tok/s):          498.74
Concurrency:                             1.00
Accept length:                           2.71
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4057.04
Median E2E Latency (ms):                 3903.28
---------------Time to First Token----------------
Mean TTFT (ms):                          103.08
Median TTFT (ms):                        99.68
P99 TTFT (ms):                           119.09
---------------Inter-Token Latency----------------
Mean ITL (ms):                           10.60
Median ITL (ms):                         10.63
P95 ITL (ms):                            11.47
P99 ITL (ms):                            11.94
Max ITL (ms):                            19.41
==================================================

After this PR:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     5
Benchmark duration (s):                  17.30
Total input tokens:                      5060
Total input text tokens:                 5060
Total input vision tokens:               0
Total generated tokens:                  5062
Total generated tokens (retokenized):    5022
Request throughput (req/s):              0.29
Input token throughput (tok/s):          292.44
Output token throughput (tok/s):         292.56
Total token throughput (tok/s):          585.01
Concurrency:                             1.00
Accept length:                           2.74
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   3458.21
Median E2E Latency (ms):                 3434.07
---------------Time to First Token----------------
Mean TTFT (ms):                          208.59
Median TTFT (ms):                        117.10
P99 TTFT (ms):                           519.46
---------------Inter-Token Latency----------------
Mean ITL (ms):                           8.77
Median ITL (ms):                         8.72
P95 ITL (ms):                            9.29
P99 ITL (ms):                            9.75
Max ITL (ms):                            17.92
==================================================

@ispobock ispobock merged commit 36a4cad into sgl-project:main Oct 23, 2025
154 of 196 checks passed
@Qiaolin-Yu Qiaolin-Yu deleted the overlap branch October 25, 2025 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants