Inference engine
Running large language models on a single GPU for throughput-oriented scenarios.
Fast inference engine for Transformer models
🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
Get up and running with OpenAI gpt-oss, DeepSeek-R1, Gemma 3 and other models.
Checkpoint-engine is a simple middleware to update model weights in LLM inference engines