Stars
A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
Accelerate inference without tears
TPU inference for vLLM, with unified JAX and PyTorch support.
🤘 TT-NN operator library, and TT-Metalium low level kernel programming model.
Universal LLM Deployment Engine with ML Compilation
Efficient Triton Kernels for LLM Training
Open-source search and retrieval database for AI applications.
TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM gateway, observability, optimization, evaluation, and experimentation.
Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
A minimal tensor processing unit (TPU), inspired by Google's TPU V2 and V1
A Datacenter Scale Distributed Inference Serving Framework
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
A high-throughput and memory-efficient inference and serving engine for LLMs