- Redmond, WA
-
23:11
(UTC -07:00)
Highlights
- Pro
Stars
FlashMLA: Efficient Multi-head Latent Attention Kernels
FlashInfer Bench @ MLSys 2026: Building AI agents to write high performance GPU kernels
Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞
Anthropic's original performance take-home, now open for you to try!
Our first fully AI generated deep learning system
Fast and memory-efficient exact attention
Autonomous GPU Kernel Generation & Optimization via Deep Agents
CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-based computation patterns and optimizations targeting NVIDIA te…
Building the Virtuous Cycle for AI-driven LLM Systems
KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)
A Datacenter Scale Distributed Inference Serving Framework
cuTile is a programming model for writing parallel kernels for NVIDIA GPUs
kernelboard is the webapp for https://www.gpumode.com
Tile-Based Runtime for Ultra-Low-Latency LLM Inference
Open Source Continuous Inference Benchmarking Qwen3.5, DeepSeek, GPTOSS - GB200 NVL72 vs MI355X vs B200 vs GB300 NVL72 vs H100 & soon™ TPUv6e/v7/Trainium2/3
ROCm / flashinfer
Forked from flashinfer-ai/flashinferFlashInfer+ROCm: ROCm port of FlashInfer
Terraform module for scalable GitHub action runners on AWS
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…
Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.
Distributed Compiler based on Triton for Parallel Systems
An implementation of a deep learning recommendation model (DLRM)
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)
verl: Volcano Engine Reinforcement Learning for LLMs