Stars
Winner 🏆 (Agent-only) MLSys 2026 - FlashInfer AI Kernel Generation Contest for the DeepSeek Sparse Attention (DSA) track with an average speedup of 34.93x
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
Official implementation of “Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding”.
DeepEP: an efficient expert-parallel communication library
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)
slime is an LLM post-training framework for RL Scaling.
Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.
RLinf: Reinforcement Learning Infrastructure for Embodied and Agentic AI
mKernel: fast multi-node, multi-GPU fused kernels
The RL Bridge for LLM-based Agent Applications. Made Simple & Flexible.
Distributed Compiler based on Triton for Parallel Systems
CUDA Templates and Python DSLs for High-Performance Linear Algebra
A framework for efficient model inference with omni-modality models
Open Source Continuous Inference Benchmark Research Platform Kimi K2.6, DeepSeekv4, GLM5 - GB200 NVL72 vs MI355X vs B200 vs GB300 NVL72 & soon™ TPUv6e/v7/Trainium2/3
Official code base for LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
Lightweight coding agent that runs in your terminal
DeepSeek 4 Flash and PRO local inference engine for Metal, CUDA and ROCm
TokenSpeed is a speed-of-light LLM inference engine.
Make Any Website into CLI & Use your logged-in browser by AI agent.
Fast LLM speculative inference server for consumer hardware.
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
high-performance linear attention kernel library built on TileLang
Mobile and Web client for Codex and Claude Code, with realtime voice, encryption and fully featured