A high-throughput and memory-efficient inference and serving engine for LLMs
-
Updated
Dec 20, 2025 - Python
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.
Parallax is a distributed model serving framework that lets you build your own AI cluster anywhere
Cross-platform FlashAttention-2 Triton implementation for Turing+ with custom configuration mode
🚀 Build and explore OpenAI's GPT-OSS model from scratch in Python, unlocking the mechanics of large language models.
Add a description, image, and links to the blackwell topic page so that developers can more easily learn about it.
To associate your repository with the blackwell topic, visit your repo's landing page and select "manage topics."