- All languages
- Assembly
- Batchfile
- C
- C#
- C++
- CMake
- CSS
- Clojure
- CoffeeScript
- Common Workflow Language
- Cuda
- Cython
- Dockerfile
- Fortran
- GAP
- Go
- HCL
- HTML
- Haskell
- Java
- JavaScript
- Jsonnet
- Julia
- Jupyter Notebook
- Kotlin
- Limbo
- Lua
- MATLAB
- MDX
- MLIR
- Makefile
- Markdown
- Objective-C
- Objective-C++
- OpenEdge ABL
- PHP
- PLSQL
- Perl
- PostScript
- PureBasic
- Python
- QML
- R
- Roff
- Ruby
- Rust
- SCSS
- Sass
- Scala
- Shell
- SourcePawn
- Svelte
- Swift
- SystemVerilog
- TSQL
- TeX
- Thrift
- TypeScript
- Vim Script
- Vue
- WebAssembly
- Zig
Starred repositories
Instant neural graphics primitives: lightning fast NeRF and more
The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
Tile primitives for speedy kernels
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
GPU Accelerated t-SNE for CUDA with Python bindings
Fully Convolutional Instance-aware Semantic Segmentation
Deformable ConvNets V2 (DCNv2) in PyTorch
Efficient GPU kernels for block-sparse matrix multiplication and convolution
RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch
Reference implementation of real-time autoregressive wavenet inference
Distribution-Aware Coordinate Representation for Human Pose Estimation
A GPU implementation of Convolutional Neural Nets in C++
PopSift is an implementation of the SIFT algorithm in CUDA.
A UNIVERSAL MUSIC TRANSLATION NETWORK - a method for translating music across musical instruments and styles.
PyTorch implementation of Deformable Convolution
Official pytorch Code for CVPR2019 paper "Fast Human Pose Estimation" https://arxiv.org/abs/1811.05419
GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.
Chamfer Distance in Pytorch with f-score
Approximate nearest neighbor search with product quantization on GPU in pytorch and cuda
Implementation of fused cosine similarity attention in the same style as Flash Attention
CUDA Matrix Factorization Library with Alternating Least Square (ALS)
GGNN: State of the Art Graph-based GPU Nearest Neighbor Search