deepseek-ai / DeepGEMM
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
See what the GitHub community is most excited about today.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
cuVS - a library for vector search and clustering on the GPU
Lightning fast differentiable SSIM.
Tile primitives for speedy kernels
Graphics Processing Units Molecular Dynamics
CUDA Kernel Benchmarking Library
LLM training in simple, raw C/CUDA
how to optimize some algorithm in cuda.
GPU accelerated decision optimization
DeepEP: an efficient expert-parallel communication library
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.