-
University of Washington
- Seattle, WA
-
15:05
(UTC -08:00)
Highlights
- Pro
Stars
A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.
CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-based computation patterns and optimizations targeting NVIDIA te…
SkyRL: A Modular Full-stack RL Library for LLMs
DeepEP: an efficient expert-parallel communication library
Trace Anything: Representing Any Video in 4D via Trajectory Fields
Post-training with Tinker
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)
Lightweight coding agent that runs in your terminal
Optimized primitives for collective multi-GPU communication
[NSDI25] AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
MSCCL++: A GPU-driven communication stack for scalable AI applications
slime is an LLM post-training framework for RL Scaling.
Code repo for efficient quantized MoE inference with mixture of low-rank compensators
A course of learning LLM inference serving on Apple Silicon for systems engineers: build a tiny vLLM + Qwen.
verl: Volcano Engine Reinforcement Learning for LLMs
Fast and memory-efficient exact attention
Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
A minimum demo for PyTorch distributed extension functionality for collectives.
A high-throughput and memory-efficient inference and serving engine for LLMs
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Fully open reproduction of DeepSeek-R1
Robust Speech Recognition via Large-Scale Weak Supervision
Transformer: PyTorch Implementation of "Attention Is All You Need"