nmoe

   _ __   _ __ ___   ___   ___
  | '_ \ | '_ ` _ \ / _ \ / _ \
  | | | || | | | | | (_) |  __/
  |_| |_||_| |_| |_|\___/ \___|

No all-to-all. No tensor parallel. B200-only.

This repo is an opinionated Mixture-of-Experts trainer hard-targeted to NVIDIA Blackwell B200 (sm_100a). MoE expert parallelism is implemented via RDEP: direct dispatch/return using CUDA IPC (intra-node) and NVSHMEM (inter-node), instead of NCCL all-to-all collectives on the expert path.

Quick start

This repository is container-first. The supported way to build and run is via the Dockerfiles in docker/.

Boot a machine with B200 GPUs and run a minimal single-GPU smoke test (moonlet) inside the training image:

# Build base image (Dockerfile.train expects this tag)
docker build -f docker/Dockerfile.base -t xjdr/nmoe:base .

# Build training image
docker build -f docker/Dockerfile.train -t xjdr/nmoe_train:latest .

# Run single-GPU training (mount /data for datasets, checkpoints, metrics)
docker run --gpus all -v /data:/data xjdr/nmoe_train:latest \
  python -m nmoe.train configs/moonlet.toml

Multi-GPU and multi-node

Single-node (8×GPU) training:

torchrun --standalone --nproc_per_node=8 -m nmoe.train configs/moonlight.toml

Multi-node runs require NVSHMEM. Build the NVSHMEM-enabled image:

docker build -f docker/Dockerfile.dist -t xjdr/nmoe_dist:latest .

Kubernetes manifests in k8s/ are templates for training, NVIZ, and profiling; edit hostnames, images, and storage before deploying.

Configs

Config	Model	Experts	GPUs	Use Case
`moonlet.toml`	7B	64 (6 active)	1	Single-GPU research
`moonlight.toml`	16B	64 (6 active)	8	Single-node RDEP
`dsv2.toml`	DeepSeek-V2	160 (6 active)	8+	Multi-node
`dsv3.toml`	DeepSeek-V3	256 (8 active)	32+	Production

Why RDEP

Traditional MoE uses NCCL all-to-all: every GPU waits for every other GPU. RDEP replaces this with direct NVSHMEM puts—each GPU writes tokens directly into the expert owner's buffer. No collective. No barrier. No waiting.

Source rank                       Owner rank
───────────                       ──────────
tokens ──▶ dispatch ─────────────▶ symmetric buffer
              │                         │
              │   nvshmem_putmem        │
              │   + atomic slot         ▼
              │                    expert GEMM
              │                         │
output ◀── scatter ◀───────────── return

Data

Training consumes pre-tokenized .npy shards.

Preprocess from HuggingFace:

python -m nmoe.data.cli prep \
    --source hf \
    --dataset HuggingFaceFW/fineweb-edu \
    --output /data/fineweb_edu \
    --name fineweb_edu

Two workflows:

Direct shards (research): set data_path in config
Flows (production): set flow_mode, mixture_toml, flow_profiles_toml

See nmoe/data/README.md for the full data pipeline.

Metrics & NVIZ

Training writes:

Experiments → SQLite (/data/experiments.db)
Metrics → DuckDB (/data/metrics/{run_id}/rank_{rank}.duckdb)

NVIZ is the included dashboard. See nviz/README.md.

Architecture

nmoe/
├── train.py          # Training loop
├── model.py          # Transformer + MoE
├── moe.py            # Fused MoE autograd
├── rdep.py           # RDEP orchestration
├── checkpoint.py     # Split checkpoints
├── config.py         # TOML config
├── metrics.py        # DuckDB writer
├── csrc/             # CUDA kernels
├── data/             # Data pipeline, HYDRA
├── attention/        # MLA, DSA, SWA
└── eval/             # Evaluation hooks

What's Inside

RDEP Kernels — Fused dispatch/return using NVSHMEM (inter-node) and IPC (intra-node). BF16 and blockscaled (FP8/NVFP4) paths.

Grouped GEMMs — cuBLASLt with per-expert scaling. SM100-optimized via CuTe DSL.

Deterministic Resume — Checkpoint includes RNG state, shard cursor, config fingerprint.

HYDRA — LLM-as-judge data quality pipeline. See nmoe/data/HYDRA.md. This repo includes nmoe/data/hydra_judge.pt (a small judge head state_dict); see nmoe/data/HYDRA_JUDGE_HEAD.md.

Tests

The project is primarily validated via end-to-end training runs. Some Triton kernels include optional pytest-guarded tests inside the module (e.g. nmoe/triton/nsa.py, nmoe/triton/swa.py).

Contributing

nmoe is intentionally narrow and opinionated: B200-only (sm_100a), RDEP expert parallelism, TOML configs, and no NCCL all-to-all on the MoE path. We prefer one clear way to do each supported job over many interchangeable stacks.

Acknowledgements

This codebase borrows ideas from and interoperates with upstream ecosystems including PyTorch, Triton, NVSHMEM, CUTLASS, and the DeepSeek family of MoE architectures. See THIRD_PARTY_NOTICES.md for license attributions.

Cite

@misc{nmoe,
  title = {nmoe: B200-targeted MoE training with RDEP},
  year = {2025},
  publisher = {GitHub}
}

Non-Goals

Tensor parallel (ever)
NCCL all-to-all for MoE (ever)
H100/A100 support
Fallback paths

One hardware target. One distribution strategy. B200 or bust.

Troubleshooting

Problem	Fix
`sm_100a` errors	You need B200. No workarounds.
NVSHMEM init fails	Use IPC mode for single-node, or check bootstrap config
OOM	Reduce `batch_size` or `seq_len`

License

Apache-2.0. See LICENSE, NOTICE, and THIRD_PARTY_NOTICES.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

nmoe

Quick start

Multi-GPU and multi-node

Configs

Why RDEP

Data

Metrics & NVIZ

Architecture

What's Inside

Tests

Contributing

Acknowledgements

Cite

Non-Goals

Troubleshooting

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
configs		configs
docker		docker
k8s		k8s
nmoe		nmoe
notebooks		notebooks
nviz		nviz
third_party		third_party
.dockerignore		.dockerignore
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
pyproject.toml		pyproject.toml

License

paperwave/nmoe

Folders and files

Latest commit

History

Repository files navigation

nmoe

Quick start

Multi-GPU and multi-node

Configs

Why RDEP

Data

Metrics & NVIZ

Architecture

What's Inside

Tests

Contributing

Acknowledgements

Cite

Non-Goals

Troubleshooting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages