-
Inferact
- SF
- https://yifanqiao.com
Highlights
- Pro
Stars
Machine Learning Engineering Open Book
FlashSampling: Fast and Memory-Efficient Exact Sampling (https://huggingface.co/papers/2603.15854)
Sardeenz is a proof-of-concept application that allows you to load more than one model on a given GPU. It allows you to add more and more models onto a GPU, until it is fully utilized.
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention
TurboDiffusion: 100–200× Acceleration for Video Diffusion Models
Helpful kernel tutorials and examples for tile-based GPU programming
cuTile is a programming model for writing parallel kernels for NVIDIA GPUs
Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
A framework for efficient model inference with omni-modality models
Tile-Based Runtime for Ultra-Low-Latency LLM Inference
Advancing the frontier of efficient AI
RouterArena: An open framework for evaluating LLM routers with standardized datasets, metrics, an automated framework, and a live leaderboard.
Context7 Platform -- Up-to-date code documentation for LLMs and AI code editors
[ICML2025, NeurIPS2025 Spotlight] Sparse VideoGen 1 & 2: Accelerating Video Diffusion Transformers with Sparse Attention
The simplest, fastest repository for training/finetuning medium-sized GPTs.
TPU inference for vLLM, with unified JAX and PyTorch support.
A scheduling framework for multitasking over diverse XPUs, including GPUs, NPUs, ASICs, and FPGAs
Fast and memory-efficient exact kmeans
The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores >74% on SWE-bench verified!
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
Puzzles for learning Triton
Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond
[MLsys2026]: RAG on Everything with LEANN. Enjoy 97% storage savings while running a fast, accurate, and 100% private RAG application on your personal device.
Research prototype of PRISM — a cost-efficient multi-LLM serving system with flexible time- and space-based GPU sharing.
gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI
SkyRL: A Modular Full-stack RL Library for LLMs