Skip to content

Tutel as an MoE backend in Nanotron for Qwen3-MoE 15B (128 experts, top-k=8) #310

@hahahaahaa

Description

@hahahaahaa

Hello :)
I’d like to use Tutel as the MoE layer implementation in Nanotron to train a Qwen3-MoE 15B model from scratch with 128 experts and top-k = 8.

Cluster with SLURM: up to 256 nodes

GPUs: 4× A100 64 GB per node.

Goal: scale across 32–1,024 GPUs with EP/TP/DP/PP.

  1. Is a similiar configuration (maybe for wen3-30B-A3B) supported out-of-the-box, or are patches required to enable a Tutel backend (e.g., a moe_config.backend: tutel switch)?

  2. Recommended parallelism layout (EP/TP/PP/DP) for 32–1,024 GPUs with 128 experts and k=8. Any guidance on expert placement to minimize all-to-all across nodes?

Many thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions