NeMo DFM (Diffusion Foundation Models) is a library under NeMo Framework, focusing on diffusion models for Video, Image, and Text generation. It unifies cutting-edge diffusion-based architectures and training techniques, prioritizing efficiency and performance from research prototyping to production deployment.
Dual-Path Architecture: DFM provides two complementary training paths to maximize flexibility:
- π Megatron Bridge Path: Built on NeMo Megatron Bridge which leverages Megatron Core for maximum scalability with n-D parallelism (TP, PP, CP, EP, VPP, DP)
- π AutoModel Path: Built on NeMo AutoModel for PyTorch DTensor-native SPMD training, for easy experimentation and also Day-0 support on π€ Hugging Face models.
Choose the path that best fits your workflowβor use both for different stages of development!
# Initialize all submodules (Megatron-Bridge, Automodel, and nested Megatron-LM)
git submodule update --init --recursive
# Build the container
docker build -f docker/Dockerfile.ci -t dfm:dev .docker run --rm -it --gpus all \
--entrypoint bash \
-v $(pwd):/opt/DFM -it dfm:devYou can find all predefined recipes under recipes directory.
Note: You will have to use uv to run the recipes. Please use
--groupasmegatron-bridge.
uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
examples/megatron/recipes/wan/pretrain_wan.py \
--config-file examples/megatron/recipes/wan/conf/wan_1_3B.yaml \
--training-mode pretrain \
--mockTrain with PyTorch-native DTensor parallelism and direct π€ HF integration:
You can find pre-configured recipes under automodel/finetune and automodel/pretrain directories.
Note: AutoModel examples live under
dfm/examples/automodel. Use uv with--group automodel. Configs are YAML-driven; pass-c <path>to override the default.
The fine-tune recipe sets up WAN 2.1 Text-to-Video training with Flow Matching using FSDP2 Hybrid Sharding.
It parallelizes heavy transformer blocks while keeping lightweight modules (e.g., VAE) unsharded for efficiency.
Adjust batch sizes, LR, and parallel sizes in dfm/examples/automodel/finetune/wan2_1_t2v_flow.yaml.
The generation script demonstrates distributed inference with AutoModel DTensor managers, producing an MP4 on rank 0. You can tweak frame size, frames, steps, and CFG in flags.
# Fine-tune WAN 2.1 T2V with FSDP2 (single node, 8 GPUs)
uv run --group automodel torchrun --nproc-per-node=8 \
dfm/examples/automodel/finetune/finetune.py \
-c dfm/examples/automodel/finetune/wan2_1_t2v_flow.yaml
# Generate videos with FSDP2 (distributed inference)
uv run --group automodel torchrun --nproc-per-node=8 \
dfm/examples/automodel/generate/wan_generate.pyMegatron Bridge delivers maximum throughput and scalability with near-linear performance to thousands of nodes. AutoModel provides an easy on-ramp for experimentation and research with PyTorch-native SPMD training.
- π₯ Multi-Modal Diffusion: Support for video, image, and text generation
- π¬ Advanced Samplers: EDM, Flow Matching, and custom diffusion schedules
- π Flexible Architectures: DiT (Diffusion Transformers), WAN (World Action Networks)
- π Efficient Data Loading: Data pipelines with sequence packing
- πΎ Distributed Checkpointing: SafeTensors-based sharded checkpoints
- π Memory Optimization: Gradient checkpointing, mixed precision, efficient attention
- π€ HuggingFace Integration: Seamless integration with the HF ecosystem
DFM provides out-of-the-box support for state-of-the-art diffusion architectures:
| Model | Type | Megatron Bridge | AutoModel | Description |
|---|---|---|---|---|
| DiT | Image/Video | pretrain, inference | π | Diffusion Transformers with scalable architecture |
| WAN 2.1 | Video | inference, pretrain, finetune | pretrain, finetune,inference | World Action Networks for video generation |
For detailed performance benchmarks including throughput metrics across different GPU systems and model configurations, see the (Performance Summary)[https://github.com/NVIDIA-NeMo/DFM/blob/main/docs/performance-summary.md] in our documentation.
DFM/
βββ dfm/
β βββ src/
β βββ megatron/ # Megatron Bridge path
β β βββ base/ # Base utilities for Megatron
β β βββ data/ # Data loaders and task encoders
β β β βββ common/ # Shared data utilities
β β β βββ <model_name>/ # model-specific data handling
β β βββ model/ # Model implementations
β β β βββ common/ # Shared model components
β β β βββ <model_name>/ # model-specific implementations
β β βββ recipes/ # Training recipes
β β βββ <model_name>/ # model-specific training configs
β βββ automodel # AutoModel path (DTensor-native)
β β βββ _diffusers/ # Diffusion pipeline integrations
β β βββ datasets/ # Dataset implementations
β β βββ distributed/ # Parallelization strategies
β β βββ flow_matching/ # Flow matching implementations
β β βββ recipes/ # Training scripts
β β βββ utils/ # Utilities and validation
β βββ common/ # Shared across both paths
β βββ data/ # Common data utilities
β βββ utils/ # Batch ops, video utils, etc.
βββ examples/ # Example scripts and configs
We welcome contributions! Please see our Contributing Guide for details on:
- Setting up your development environment
- Code style and testing guidelines
- Submitting pull requests
- Reporting issues
For questions or discussions, please open an issue on GitHub.
NeMo DFM builds upon the excellent work of:
- Megatron-core - Advanced model parallelism
- Megatron Bridge - HuggingFace β Megatron bridge
- NeMo AutoModel - PyTorch-native SPMD training
- PyTorch Distributed - Foundation for distributed training
- Diffusers - Diffusion model implementations