Stars
DFloat11 [NeurIPS '25]: Lossless Compression of LLMs and DiTs for Efficient GPU Inference
A library of GPU kernels for sparse matrix operations.
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
Helpful kernel tutorials and examples for tile-based GPU programming
β‘FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)
Accelerating MoE with IO and Tile-aware Optimizations
[ASPLOS'26] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter
Trainable fast and memory-efficient sparse attention
ππ Efficient implementations of Native Sparse Attention
fanshiqing / grouped_gemm
Forked from tgale96/grouped_gemmPyTorch bindings for CUTLASS grouped GEMM.
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
Official PyTorch Implementation of "Diffusion Transformers with Representation Autoencoders"
StreamingVLM: Real-Time Understanding for Infinite Video Streams
A framework for serving and evaluating LLM routers - save LLM costs without compromising quality
Modeling, training, eval, and inference code for OLMo
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
[CVPR 2025 Oral] Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
A lightweight Inference Engine built for block diffusion models
VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction
Implementation for FP8/INT8 Rollout for RL training without performence drop.
Paper reading and discussion notes, covering AI frameworks, distributed systems, cluster management, etc.
Research prototype of PRISM β a cost-efficient multi-LLM serving system with flexible time- and space-based GPU sharing.