Stars
DGXC Benchmarking provides recipes in ready-to-use templates for evaluating performance of specific AI use cases across hardware and software combinations.
A library for exporting models including NeMo and Hugging Face to optimized inference backends, and deploying them for efficient querying
An experimental implementation of compiler-driven automatic sharding of models across a given device mesh.
FlagGems is an operator library for large language models implemented in the Triton Language.
Ship correct and fast LLM kernels to PyTorch
Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging Face support
Scalable toolkit for efficient model reinforcement
Distributed Compiler based on Triton for Parallel Systems
The NVIDIA NeMo Agent toolkit is an open-source library for efficiently connecting and optimizing teams of AI agents.
CUDA Templates and Python DSLs for High-Performance Linear Algebra
TorchOpt is an efficient library for differentiable optimization built upon PyTorch.
Run PyTorch LLMs locally on servers, desktop and mobile
verl: Volcano Engine Reinforcement Learning for LLMs
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
The tool facilitates debugging convergence issues and testing new algorithms and recipes for training LLMs using Nvidia libraries such as Transformer Engine, Megatron-LM, and NeMo.
Safe code refactoring for modern Python.
these are custom recipes of nvidia nsight system post collection analysis.
A lightweight library for PyTorch training tools and utilities
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
[ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptation
pytest plugin for distributed testing and loop-on-failures testing modes.