Stars
CLI proxy that reduces LLM token consumption by 60-90% on common dev commands. Single Rust binary, zero dependencies
Open Source AI Platform - AI Chat with advanced features that works with every LLM
LMCache: Supercharge Your LLM with the Fastest KV Cache Layer
DSPy: The framework for programming—not prompting—language models
ARIS ⚔️ (Auto-Research-In-Sleep) — Lightweight Markdown-only skills for autonomous ML research: cross-model review loops, idea discovery, and experiment automation. No framework, no lock-in — works…
Offline optimization of your disaggregated Dynamo graph
A Datacenter Scale Distributed Inference Serving Framework
Distributed MoE in a Single Kernel [NeurIPS '25]
High-Performance KV Cache Storage Engine on CXL Shared Memory for LLM Inference
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
A collection of prompts, system prompts and LLM instructions
Extracted system prompts from Anthropic - Claude Fable 5, Opus 4.8, Claude Code, Claude Design. OpenAI - ChatGPT 5.5 Thinking, GPT 5.5 Instant, Codex. Google - Gemini 3.5 Flash, 3.1 Pro, Antigravit…
Causal depthwise conv1d in CUDA, with a PyTorch interface
🚀 Efficient implementations for emerging model architectures
An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & VLM & TIS & vLLM & Ray & Async RL)
TPU inference for vLLM, with unified JAX and PyTorch support.
[NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)
A curriculum for learning about gpu performance engineering, from scratch to what the frontier AI labs do
DeepEP: an efficient expert-parallel communication library
An Open-Source Asynchronous Coding Agent
LLM-powered multiagent persona simulation for imagination enhancement and business insights.
ToolOrchestra is an end-to-end RL training framework for orchestrating tools and agentic workflows.
ArcticInference: vLLM plugin for high-throughput, low-latency inference
Load Balancer Implementation for Kubernetes in Bare-Metal, Edge, and Virtualization
AWS Neuron Deep Learning Containers (DLCs) are a set of Docker images for training and serving models on AWS Trainium and Inferentia instances using AWS Neuron SDK.