-
Shanghai Jiao Tong University
- Ann Arbor, MI
- https://risc-lt.github.io/
- @letianruan
Highlights
- Pro
Stars
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
DeepEP: an efficient expert-parallel communication library
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
FlashInfer: Kernel Library for LLM Serving
how to optimize some algorithm in cuda.
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
[NeurIPS 2025] ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive