Stars
Early-stage Rust drop-in alternative frontend for vLLM
FlashInfer: Kernel Library for LLM Serving
Notes on AI infrastructure, inference systems, and engineering trade-offs.
An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale
Bridge local AI coding agents (Claude Code, Cursor, Gemini CLI, Codex) to messaging platforms (Feishu/Lark, DingTalk, Slack, Telegram, Discord, LINE, WeChat Work). Chat with your AI dev assistant f…
Agent2Agent (A2A) is an open protocol enabling communication and interoperability between opaque agentic applications.
Synchronizing Claude Code conversations across machines
A smarter cd command. Supports all major shells.
Ming-omni-tts: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control
vLLM Model plugin for the encoder-decoder BART model
MOSS‑TTS Family is an open‑source speech and sound generation model family from MOSI.AI and the OpenMOSS team. It is designed for high‑fidelity, high‑expressiveness, and complex real‑world scenario…
Qwen-Image is a powerful image generation foundation model capable of complex text rendering and precise image editing.
DFlash: Block Diffusion for Flash Speculative Decoding
[CVPR 2025 Best Paper Award] VGGT: Visual Geometry Grounded Transformer
A debugging and profiling tool that can trace and visualize python code execution
Community maintained hardware plugin for vLLM on Apple Silicon
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
Tile-Based Runtime for Ultra-Low-Latency LLM Inference
A high-performance and light-weight router for vLLM large scale deployment
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
A PyTorch-native inference engine with cache, parallelism, quantization for Diffusion Transformers.
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
A framework for efficient model inference with omni-modality models
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
An early research stage expert-parallel load balancer for MoE models based on linear programming.