-
AI Frameworks Engineer @intel
- SH
-
19:07
(UTC +08:00)
Lists (5)
Sort Name ascending (A-Z)
Stars
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
DeepEP: an efficient expert-parallel communication library
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
FlashInfer: Kernel Library for LLM Serving
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
how to optimize some algorithm in cuda.
Learn CUDA Programming, published by Packt
Flash Attention in ~100 lines of CUDA (forward pass only)
Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)
A simple high performance CUDA GEMM implementation.
Step-by-step optimization of CUDA SGEMM
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
A set of hands-on tutorials for CUDA programming
Distributed MoE in a Single Kernel [NeurIPS '25]
Benchmark code for the "Online normalizer calculation for softmax" paper