PhD student in Computer Science at TSAIL Group, Tsinghua University, @thu-ml.
Interested in pretraining, optimization, theory for LLMs.
-
@thu-ml, Tsinghua University
- Beijing, China
-
22:52
(UTC +09:00) - https://bingrui-li.github.io/
- @bingruili_
- @bingruil.bsky.social
Stars
7
results
for source starred repositories
written in Cuda
Clear filter
DeepEP: an efficient expert-parallel communication library
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
This package contains the original 2012 AlexNet code.
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
Flash Attention in ~100 lines of CUDA (forward pass only)
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
flash attention tutorial written in python, triton, cuda, cutlass