Stars
AI agents running research on single-GPU nanochat training automatically
Nsight Python is a Python kernel profiling interface based on NVIDIA Nsight Tools
CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-based computation patterns and optimizations targeting NVIDIA te…
Helpful kernel tutorials and examples for tile-based GPU programming
cuTile is a programming model for writing parallel kernels for NVIDIA GPUs
Helpful tools and examples for working with flex-attention
Parrot is a C++ library for fused array operations using CUDA/Thrust. It provides efficient GPU-accelerated operations with lazy evaluation semantics, allowing for chaining of operations without un…
Customized matrix multiplication kernels
Simple, portable, and self-contained stacktrace library for C++11 and newer
Header-only C++/python library for fast approximate nearest neighbors
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
FlashInfer: Kernel Library for LLM Serving
SGLang is a high-performance serving framework for large language models and multimodal models.
Convert PDF to markdown + JSON quickly with high accuracy
A community-maintained Python framework for creating mathematical animations.
Generate audiobooks from e-books, voice cloning & 1158+ languages!
A collection of inspiring lists, manuals, cheatsheets, blogs, hacks, one-liners, cli/web tools and more.
NVIDIA Math Libraries for the Python Ecosystem
llama3 implementation one matrix multiplication at a time
PyTorch compiler that accelerates training and inference. Get built-in optimizations for performance, memory, parallelism, and easily write your own.
🔥Highlighting the top ML papers every week.
Zero Bubble Pipeline Parallelism
GPU programming related news and material links
A minimal programming example for a chat server