-
Notifications
You must be signed in to change notification settings - Fork 35
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
When we run gpt-oss-20b PEFT training on a single GPU, the script fails because n_routed_experts is not recognized.
[rank0]: File "/opt/Automodel/nemo_automodel/components/moe/layers.py", line 493, in forward
[rank0]: assert self.n_routed_experts % self.ep_size == 0, (
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__
[rank0]: raise AttributeError(
[rank0]: AttributeError: 'GroupedExpertsDeepEP' object has no attribute 'n_routed_experts'
When we disable DeepEP, the script runs as expected.
The difference between GroupedExperts and GroupedExpertsDeepEP is that n_routed_experts is only set in the DeepEP version by the method init_token_dispatcher, which appears to only be called in multi-GPU settings. In vanilla GroupedExperts it's set in the __init__ method.
Steps/Code to reproduce bug
Run an MOE config with DeepEP enabled on a single GPU.
Expected behavior
Script runs.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working