Stars
A high-throughput and memory-efficient inference and serving engine for LLMs
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
SGLang is a high-performance serving framework for large language models and multimodal models.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
My learning notes for ML SYS.
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
FlashInfer: Kernel Library for LLM Serving
Performance-optimized AI inference on your GPUs. Unlock superior throughput by selecting and tuning engines like vLLM or SGLang.
A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.
A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.
Analysis leveldb source code step by step
Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serving systems.
[ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.
This repo release the detailed benchmark code and results of Sea Labs AI.