Skip to content
View andyxning's full-sized avatar
🎯
Focusing
🎯
Focusing
  • Beijing, China

Organizations

@nsqio @kubernetes

Block or report andyxning

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

DeepEP: an efficient expert-parallel communication library

Cuda 9,730 1,284 Updated Jun 11, 2026

mKernel: fast multi-node, multi-GPU fused kernels

Cuda 231 22 Updated Jun 8, 2026

Tile primitives for speedy kernels

Cuda 3,429 295 Updated May 27, 2026

A curated list of best cuda programming books

911 29 Updated May 19, 2026

Machine Learning Engineering Open Book

Python 18,113 1,150 Updated May 18, 2026

Module, Model, and Tensor Serialization/Deserialization

Python 313 52 Updated Apr 30, 2026

Benchmark suite for LLMs from Fireworks.ai

Python 105 39 Updated Jun 11, 2026

High Performance LLM Inference Operator Library

C++ 935 96 Updated Jun 11, 2026

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

Python 5,314 395 Updated Apr 20, 2026

SkyRL: A Modular Full-stack RL Library for LLMs

Python 1,996 351 Updated Jun 14, 2026

LLMPerf is a library for validating and benchmarking LLMs

Python 1,119 205 Updated Dec 9, 2024

Manages Unified Access to Generative AI Services built on Envoy Gateway

Go 1,749 273 Updated Jun 14, 2026
C++ 543 45 Updated Apr 1, 2026

Inference server benchmarking tool

Rust 162 32 Updated Jun 9, 2026

Fluid, elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF)

Go 1,935 1,250 Updated Jun 14, 2026

Using CRDs to manage GPU resources in Kubernetes.

Go 211 29 Updated Nov 21, 2022

A GPU cluster manager that configures and orchestrates inference engines like vLLM and SGLang for high-performance AI model deployment.

Python 5,155 546 Updated Jun 14, 2026

Heterogeneous GPU Sharing on Kubernetes

Go 3,564 576 Updated Jun 12, 2026

AI on GKE is a collection of examples, best-practices, and prebuilt solutions to help build, deploy, and scale AI Platforms on Google Kubernetes Engine

Jupyter Notebook 328 243 Updated Jun 23, 2025

Cloud Native Benchmarking of Foundation Models

Python 45 20 Updated Jul 31, 2025

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology

C 1,385 189 Updated Jun 13, 2026

LLM KV cache compression made easy

Python 1,114 153 Updated Jun 10, 2026

Open source AI coding agent. Designed for large projects and real world tasks.

Go 15,453 1,142 Updated Oct 3, 2025

A CLI inspector for the Model Context Protocol

JavaScript 439 38 Updated Jun 8, 2026

Serving multiple LoRA finetuned LLM as one

Python 1,159 63 Updated May 8, 2024

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

Python 1,912 124 Updated Jan 21, 2024

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

Python 4,119 333 Updated Jun 14, 2026

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Python 3,795 319 Updated May 28, 2026
Next