-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Feature] Full Cuda Graph Support for Cutlass MLA and 6% E2E Throughput Improvement #22763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Full Cuda Graph Support for Cutlass MLA and 6% E2E Throughput Improvement #22763
Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables full CUDA graph support for Cutlass MLA in decode-only scenarios. The changes are minimal and correctly implemented by introducing a CutlassMLAMetadataBuilder
that signals this capability. My review includes a suggestion to improve code style for better adherence to PEP 8.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thanks for doing this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😮 that is just about as clean as you can do it
Are there any unit tests we have for full cudagraph attention backends? Just thinking how we can test this over time
when it works out of the box >> |
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Sounds good, just adding a new unit test for it pytest vllm/tests/compile/piecewise/test_full_cudagraph.py
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================== 2 passed, 22 skipped, 3 warnings in 119.77s (0:01:59) ==================================== |
…ut Improvement (vllm-project#22763) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…ut Improvement (vllm-project#22763) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…ut Improvement (vllm-project#22763) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>
…ut Improvement (vllm-project#22763) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…ut Improvement (vllm-project#22763) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
…ut Improvement (vllm-project#22763) Signed-off-by: yewentao256 <zhyanwentao@126.com>
Purpose
Support full cuda graph for cutlass MLA (SM100) and 6% E2E Throughput Improvement
Thanks for previous work of enabling cutlass MLA on SM100!
Test
vllm serve deepseek-ai/DeepSeek-V2-Lite --port 10256 --enable-expert-parallel --data-parallel-size 2 --trust_remote_code -O '{"full_cuda_graph": true}' --cuda-graph-sizes 16 32 64 128 256 512
Acc
Perf