Stars
Allow torch tensor memory to be released and resumed later
VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
The best workflows and configurations I've developed, having heavily used Claude Code since the day of it's release. Workflows are based off applied learnings from our AI-native startup.
Training library for Megatron-based models
slime is an LLM post-training framework for RL Scaling.
Distributed Compiler based on Triton for Parallel Systems
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
Pipeline Parallelism Emulation and Visualization
Analyze computation-communication overlap in V3/R1.
DeepEP: an efficient expert-parallel communication library
A library to analyze PyTorch traces.
A PyTorch native platform for training generative AI models
A PyTorch Toolbox for Grouped GEMM in MoE Model Training
A Project dedicated to making GPU Partitioning on Windows easier!
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
chinalist for SwitchyOmega and SmartProxy
Scalable toolkit for efficient model alignment
Virtual whiteboard for sketching hand-drawn like diagrams
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance…
Latency and Memory Analysis of Transformer Models for Training and Inference