-
Tsinghua University
- Beijing, China
- https://jason-huang03.github.io/
Stars
Accelerating MoE with IO and Tile-aware Optimizations
Boosting 4-bit inference kernels with 2:4 Sparsity
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
A Triton-only attention backend for vLLM
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
cuTile is a programming model for writing parallel kernels for NVIDIA GPUs
Helpful kernel tutorials and examples for tile-based GPU programming
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
torchcomms: a modern PyTorch communications API
Efficient triton implementation of Native Sparse Attention.
🚀🚀 Efficient implementations of Native Sparse Attention
[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLM, VLM, and video generation models.
Development repository for the Triton language and compiler
Propositions of solutions to the exercises from Terence Tao's textbooks, Analysis I & II. Mirrored from https://gitlab.com/f-santos/taoanalysissolutions
Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Large Language Models.
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
MSCCL++: A GPU-driven communication stack for scalable AI applications
slime is an LLM post-training framework for RL Scaling.
[NeurIPS 2025] An official implementation of Flow-GRPO: Training Flow Matching Models via Online RL
Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.