Stars
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Ongoing research training transformer models at scale
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance…
A PyTorch native platform for training generative AI models
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
Development repository for the Triton language and compiler
Training library for Megatron-based models
Virtual whiteboard for sketching hand-drawn like diagrams
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
Oh my tmux! My self-contained, pretty & versatile tmux configuration made with 💛🩷💙🖤❤️🤍
Making large AI models cheaper, faster and more accessible
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
Open deep learning compiler stack for cpu, gpu and specialized accelerators
cudnn_frontend provides a c++ wrapper for the cudnn backend API and samples on how to use it
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
slime is an LLM post-training framework for RL Scaling.
A validation and profiling tool for AI infrastructure
A library to analyze PyTorch traces.
Optimized primitives for collective multi-GPU communication
DeepEP: an efficient expert-parallel communication library
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (V…
Fast and memory-efficient exact attention
Reference implementations of MLPerf® training benchmarks
Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training