Stars
Allow torch tensor memory to be released and resumed later
VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
The best workflows and configurations I've developed, having heavily used Claude Code since the day of it's release. Workflows are based off applied learnings from our AI-native startup.
Training library for Megatron-based models
thunlp / Seq1F1B
Forked from NVIDIA/Megatron-LMSequence-level 1F1B schedule for LLMs.
slime is an LLM post-training framework for RL Scaling.
Distributed Compiler based on Triton for Parallel Systems
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
Pipeline Parallelism Emulation and Visualization
Analyze computation-communication overlap in V3/R1.
DeepEP: an efficient expert-parallel communication library
A library to analyze PyTorch traces.
A PyTorch native platform for training generative AI models
A PyTorch Toolbox for Grouped GEMM in MoE Model Training
fanshiqing / grouped_gemm
Forked from tgale96/grouped_gemmPyTorch bindings for CUTLASS grouped GEMM.
A Project dedicated to making GPU Partitioning on Windows easier!
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
dzhulgakov / llama-mistral
Forked from meta-llama/llamaInference code for Mistral and Mixtral hacked up into original Llama implementation
chinalist for SwitchyOmega and SmartProxy
Scalable toolkit for efficient model alignment
Virtual whiteboard for sketching hand-drawn like diagrams