Starred repositories
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
DeepEP: an efficient expert-parallel communication library
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
FlashInfer: Kernel Library for LLM Serving
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
how to optimize some algorithm in cuda.
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
Flash Attention in ~100 lines of CUDA (forward pass only)
RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …
cuVS - a library for vector search and clustering on the GPU
Static suckless single batch CUDA-only qwen3-0.6B mini inference engine
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Approximate nearest neighbor search with product quantization on GPU in pytorch and cuda
CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and qua…
CUDA implementation of Hierarchical Navigable Small World Graph algorithm
GGNN: State of the Art Graph-based GPU Nearest Neighbor Search
Benchmark code for the "Online normalizer calculation for softmax" paper
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
ademeure / DeeperGEMM
Forked from deepseek-ai/DeepGEMMDeeperGEMM: crazy optimized version
A High-Throughput Multi-GPU System for Graph-Based Approximate Nearest Neighbor Search
High-Throughput, Cost-Effective Billion-Scale Vector Search with a Single GPU [to appear in SIGMOD'26]
A cross-modal vector index with fast construction on heterogeneous CPU-GPU environment. Published on DaMoN@SIGMOD 2025.
Adamas: Hadamard Sparse Attention for Efficient Long-context Inference