Skip to content

gpt-oss-20b PEFT fails one 1 GPU with DeepEP issue #968

@torsli

Description

@torsli

Describe the bug
When we run gpt-oss-20b PEFT training on a single GPU, the script fails because n_routed_experts is not recognized.

[rank0]:   File "/opt/Automodel/nemo_automodel/components/moe/layers.py", line 493, in forward
[rank0]:     assert self.n_routed_experts % self.ep_size == 0, (
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__
[rank0]:     raise AttributeError(
[rank0]: AttributeError: 'GroupedExpertsDeepEP' object has no attribute 'n_routed_experts'

When we disable DeepEP, the script runs as expected.

The difference between GroupedExperts and GroupedExpertsDeepEP is that n_routed_experts is only set in the DeepEP version by the method init_token_dispatcher, which appears to only be called in multi-GPU settings. In vanilla GroupedExperts it's set in the __init__ method.

Steps/Code to reproduce bug

Run an MOE config with DeepEP enabled on a single GPU.

Expected behavior

Script runs.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions