- All languages
- Assembly
- Astro
- Bikeshed
- C
- C#
- C++
- CMake
- CSS
- Crystal
- Cuda
- Cython
- Dart
- Dockerfile
- EJS
- Elixir
- Fortran
- GLSL
- Go
- Groovy
- HCL
- HTML
- Java
- JavaScript
- Julia
- Jupyter Notebook
- LLVM
- Lua
- MATLAB
- MDX
- MLIR
- Makefile
- Markdown
- Mojo
- Nim
- Nix
- PDDL
- PHP
- PowerShell
- Puppet
- Python
- R
- ReScript
- Ruby
- Rust
- SAS
- SCSS
- Scala
- Shell
- Solidity
- Stan
- Svelte
- Swift
- TeX
- TypeScript
- Vue
- WebAssembly
- YAML
Starred repositories
Region-level profiling for CUDA kernels with trace, NVBit, CUPTI, NSys, and an interactive Explorer.
A pure-Python implementation of the Nvidia CuTe layout algebra intended to be approachable and easy to learn.
A plug-and-play compiler that delivers free-lunch optimizations for both inference and training.
stdgpu: Efficient STL-like Data Structures on the GPU
An MLIR-based compiler that takes GPU kernels and compiles them to real hardware instructions. Interactive web visualizer included.
Review automated kernel generation in the era of LLMs
GPU-accelerated Schulze voting method in Python, Numba, CUDA, and Mojo 🔥, using ideas from Algebraic Graph Theory
Nvidia Instruction Set Specification Generator
A collection of study materials for AI compilers and systems.
A chronologically sorted list of influential papers on compiler optimization, from the seminal works of 1952 through the advanced techniques of 1994
A concise explanation of Rust types and Memory Layout.
🚴 Call stack profiler for Python. Shows you why your code is slow!
Minimal and annotated implementations of key ideas from modern deep learning research.
Distributed Compiler based on Triton for Parallel Systems
A simple calculation for LLM MFU.
Low-Level Programming Roadmap and Resources
Tutorial on building a gpu compiler backend in LLVM
Compiling useful links, papers, benchmarks, ideas, etc.
This is the homepage of a new book entitled "Mathematical Foundations of Reinforcement Learning."
Awesome Reasoning LLM Tutorial/Survey/Guide
This is an online course where you can learn and master the skill of low-level performance analysis and tuning.
Analyze computation-communication overlap in V3/R1.