Starred repositories
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)
FlashInfer: Kernel Library for LLM Serving
Supercharge Your LLM with the Fastest KV Cache Layer
Quilt is a serverless optimizer that automatically merges workflows that consist of many functions (possibly in different languages) into one process thereby avoiding high invocation latency, commu…
This is a public version of LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision
Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond
Achieve state of the art inference performance with modern accelerators on Kubernetes
TokenSim is a tool for simulating the behavior of large language models (LLMs) in a distributed environment.
Train speculative decoding models effortlessly and port them smoothly to SGLang serving.
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024
Serverless LLM Serving for Everyone.
Large Language Model Text Generation Inference
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)
Artifact for "Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving" [SOSP '24]
EE-LLM is a framework for large-scale training and inference of early-exit (EE) large language models (LLMs).
A large-scale simulation framework for LLM inference
A high-throughput and memory-efficient inference and serving engine for LLMs