- Hilbert Space
-
00:23
(UTC +08:00) - in/zhwangcs
Starred repositories
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
DeepEP: an efficient expert-parallel communication library
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
how to optimize some algorithm in cuda.
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
Fast CUDA matrix multiplication from scratch
Flash Attention in ~100 lines of CUDA (forward pass only)
RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …
cuVS - a library for vector search and clustering on the GPU
Static suckless single batch CUDA-only qwen3-0.6B mini inference engine
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and qua…
Approximate nearest neighbor search with product quantization on GPU in pytorch and cuda
CUDA implementation of Hierarchical Navigable Small World Graph algorithm
GGNN: State of the Art Graph-based GPU Nearest Neighbor Search
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
Benchmark code for the "Online normalizer calculation for softmax" paper
ademeure / DeeperGEMM
Forked from deepseek-ai/DeepGEMMDeeperGEMM: crazy optimized version
High-Throughput, Cost-Effective Billion-Scale Vector Search with a Single GPU [to appear in SIGMOD'26]
A High-Throughput Multi-GPU System for Graph-Based Approximate Nearest Neighbor Search
A cross-modal vector index with fast construction on heterogeneous CPU-GPU environment. Published on DaMoN@SIGMOD 2025.