-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Model] Apply SharedFusedMoE to glm4_moe. #24849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors Glm4MoE
to use SharedFusedMoE
when shared experts are present, which is a good improvement for flexibility and performance optimization on Ascend hardware. However, I've found a few critical issues in the implementation that need to be addressed. There's a bug in the forward
method that will cause a TypeError
during runtime, and another issue in the __init__
method that leads to incorrect model behavior by ignoring the routed_scaling_factor
. I've also suggested a refactoring to improve code clarity and maintainability by reducing duplication.
Signed-off-by: whx-sjtu <2952154980@qq.com>
1d21408
to
036509e
Compare
3305fe1
to
b1795cf
Compare
I only added SharedFusedMoE support for deepseek and llama4 since adding it everywhere would have been more disruptive. There's no reason it can't be applied to other models that use shared experts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Label |
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
The class
SharedFusedMoE
was proposed by @bnellnm in PR #23273. The model glm4_moe has shared experts but we don't useSharedFusedMoE
for glm4_moe, I'm not sure why, please cc @bnellnm.Let glm4_moe uses SharedFusedMoE can solve two problems in vllm-ascend:
maybe_all_reduce_tensor_model_parallel
, and sometimes perform all-reduce of shared experts independently. However in currently glm4_moe modeling, whether perform all-reduce of shared experts independently is determined for sure in init bymust_reduce_shared_expert_outputs
, which conficts our implemention.SharedFusedMoE
.In conclusion, I think we should use
SharedFusedMoE
to glm4_moe. cc @bnellnm @robertgshaw2-redhat @wangxiyuan @LucasWilkinson @YikunTest Plan
No need to add new test.
Test ResultResult
all tests should pass
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.