Lists (7)
Sort Name ascending (A-Z)
Stars
A plug-and-play compiler that delivers free-lunch optimizations for both inference and training.
A pure-Python implementation of the Nvidia CuTe layout algebra intended to be approachable and easy to learn.
Large Language Model Text Generation Inference
Learning notes and hands-on experiments for understanding modern Machine Learning System.
Free, Local CharacterAI with inference on Apple Silicon and ESP32 WebSocket transport
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
The RunPod worker template for serving our large language model endpoints. Powered by vLLM.
Qwen3-TTS with nano vLLM-style optimizations for fast text-to-speech generation. Achieved 3x faster
SGLang Omni: High-Performance Multi-Stage Pipeline Framework for Omni Models
Claude Code skills that turn any codebase into an interactive knowledge graph you can explore, search, and ask questions about (Multi-platform e.g., Codex are supported).
a LLM inference engine to run on consumer hardware
Complete solutions to the Programming Massively Parallel Processors Edition 4
Dashboard for InferenceX™, Open Source Continuous Inference
A clean, single-file PyTorch implementation of Attention Residuals (Kimi Team, MoonshotAI, 2026), integrated with Grouped Query Attention (GQA), SwiGLU feed-forward networks, and Rotary Position Em…
A fast, helpful, and open-source document parser
Give your agents the power of the Hugging Face ecosystem
A high-performance inference system for large language models, designed for production environments.
MimikaStudio - A local-first application for macOS (Apple Silicon) + Agentic MCP Support
Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.
Community maintained hardware plugin for vLLM on Intel Gaudi
A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM
Fully autonomous & self-evolving research from idea to paper. Chat an Idea. Get a Paper. 🦞
Our Clone of Orca used for experimentation
A lightweight, efficient transformer inference engine written in Rust. MiniLLM provides a clean, well-documented implementation of GPT-2 style transformer models with support for text generation.