Stars
Microbenchmarking hyperparameter tuning for JAX functions.
GPUGrants - a list of GPU grants that I can think of
Automated High-Performance GPU Kernel Generation
SGLang is a high-performance serving framework for large language models and multimodal models.
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)
A PyTorch-native inference engine with cache, parallelism, quantization and cpu offload for DiTs.
DFlash: Block Diffusion for Flash Speculative Decoding
Artifact for "Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs" [arXiv '25]
A curriculum for learning about gpu performance engineering, from scratch to what the frontier AI labs do
how to optimize some algorithm in cuda.
System Intelligence Benchmark
Efficient Long-context Language Model Training by Core Attention Disaggregation
A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.
Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
Accelerate inference without tears
TPU inference for vLLM, with unified JAX and PyTorch support.
🤘 TT-NN operator library, and TT-Metalium low level kernel programming model.
Universal LLM Deployment Engine with ML Compilation