Highlights
- Pro
Stars
4
stars
written in Cuda
Clear filter
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
DeepEP: an efficient expert-parallel communication library
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and qua…