Starred repositories
DeepEP: an efficient expert-parallel communication library
mKernel: fast multi-node, multi-GPU fused kernels
Tile primitives for speedy kernels
A curated list of best cuda programming books
Machine Learning Engineering Open Book
Module, Model, and Tensor Serialization/Deserialization
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
SkyRL: A Modular Full-stack RL Library for LLMs
LLMPerf is a library for validating and benchmarking LLMs
Manages Unified Access to Generative AI Services built on Envoy Gateway
Inference server benchmarking tool
Fluid, elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF)
Using CRDs to manage GPU resources in Kubernetes.
A GPU cluster manager that configures and orchestrates inference engines like vLLM and SGLang for high-performance AI model deployment.
AI on GKE is a collection of examples, best-practices, and prebuilt solutions to help build, deploy, and scale AI Platforms on Google Kubernetes Engine
Cloud Native Benchmarking of Foundation Models
A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
Open source AI coding agent. Designed for large projects and real world tasks.
A CLI inspector for the Model Context Protocol
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs