-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Support overlap-spec-v2 with trtllm_mla attention backend #11821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @Qiaolin-Yu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates support for the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for overlap-spec-v2 with the trtllm_mla attention backend. The changes involve updating the attention backend to recognize the new DRAFT_EXTEND_V2 forward mode. Additionally, a utility method clear_cache_pool is added to EAGLEWorkerV2 for cleaning up memory pools, likely for testing purposes. My review includes a suggestion to simplify the implementation of this new utility method for better clarity and maintainability.
| self.draft_worker.req_to_token_pool.clear() | ||
| self.draft_worker.token_to_kv_pool_allocator.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The EAGLEWorkerV2 class already has direct access to req_to_token_pool and token_to_kv_pool_allocator. Accessing them through self.draft_worker is an unnecessary indirection. For better clarity and more direct code, you can call .clear() on the attributes of self directly.
| self.draft_worker.req_to_token_pool.clear() | |
| self.draft_worker.token_to_kv_pool_allocator.clear() | |
| self.req_to_token_pool.clear() | |
| self.token_to_kv_pool_allocator.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Launch server:
SGLANG_USE_CUTLASS_BACKEND_FOR_FP4_GEMM=1 TRTLLM_ENABLE_PDL=1 python3 -m sglang.launch_server --model-path nvidia/DeepSeek-V3-0324-FP4 --trust-remote-code --quantization modelopt_fp4 --tp 8 --speculative-algorithm=EAGLE --port 40020 --enable-beta-spec --kv-cache-dtype fp8_e4m3Bench:
python3 -m sglang.bench_serving --backend sglang-oai --dataset-name random --random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.98 --num-prompts 5 --max-concurrency 1 --output-file res.jsonl --port 40020Before this PR (without --enable-beta-spec):
============ Serving Benchmark Result ============
Backend: sglang-oai
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 5
Benchmark duration (s): 20.30
Total input tokens: 5060
Total input text tokens: 5060
Total input vision tokens: 0
Total generated tokens: 5062
Total generated tokens (retokenized): 5022
Request throughput (req/s): 0.25
Input token throughput (tok/s): 249.32
Output token throughput (tok/s): 249.42
Total token throughput (tok/s): 498.74
Concurrency: 1.00
Accept length: 2.71
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 4057.04
Median E2E Latency (ms): 3903.28
---------------Time to First Token----------------
Mean TTFT (ms): 103.08
Median TTFT (ms): 99.68
P99 TTFT (ms): 119.09
---------------Inter-Token Latency----------------
Mean ITL (ms): 10.60
Median ITL (ms): 10.63
P95 ITL (ms): 11.47
P99 ITL (ms): 11.94
Max ITL (ms): 19.41
==================================================
After this PR:
============ Serving Benchmark Result ============
Backend: sglang-oai
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 5
Benchmark duration (s): 17.30
Total input tokens: 5060
Total input text tokens: 5060
Total input vision tokens: 0
Total generated tokens: 5062
Total generated tokens (retokenized): 5022
Request throughput (req/s): 0.29
Input token throughput (tok/s): 292.44
Output token throughput (tok/s): 292.56
Total token throughput (tok/s): 585.01
Concurrency: 1.00
Accept length: 2.74
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 3458.21
Median E2E Latency (ms): 3434.07
---------------Time to First Token----------------
Mean TTFT (ms): 208.59
Median TTFT (ms): 117.10
P99 TTFT (ms): 519.46
---------------Inter-Token Latency----------------
Mean ITL (ms): 8.77
Median ITL (ms): 8.72
P95 ITL (ms): 9.29
P99 ITL (ms): 9.75
Max ITL (ms): 17.92
==================================================
Motivation
Modifications
To open this feature, use
--enable-beta-specThis pr can increase the peformance of bs=1 from 250 to 280 token/s
After this pr, it's still not fully overlap (figure 1) because these three kernels are too slow (figure 2). Will fix this in the future pr.
Accuracy Tests
Benchmarking and Profiling
Checklist