A TurboQuant LLM inference server.
Most LLM serving stacks force a trade-off between features and resource usage. inferrs targets both:
| inferrs | vLLM | llama.cpp | |
|---|---|---|---|
| Language | Rust | Python/C++ | C/C++ |
| Streaming (SSE) | ✓ | ✓ | ✓ |
| KV cache management | TurboQuant, Per-context alloc, PagedAttention | PagedAttention | Per-context alloc |
| Memory friendly | ✓ — lightweight | ✗ — claims most GPU memory | ✓ — lightweight |
| Binary footprint | Single binary | Python environment + deps | Single binary |
- OpenAI-compatible API —
/v1/completions,/v1/chat/completions,/v1/models,/health - Anthropic-compatible API —
/v1/messages(streaming and non-streaming) - Ollama-compatible API —
/api/generate,/api/chat,/api/tags,/api/ps,/api/show,/api/version - Hardware backends — CUDA, ROCm, Metal, Hexagon, OpenVino, MUSA, CANN, Vulkan and CPU
macOS / Linux
brew tap ericcurtin/inferrs
brew install inferrsWindows
scoop bucket add inferrs https://github.com/ericcurtin/scoop-inferrs
scoop install inferrsinferrs run google/gemma-4-E2B-itinferrs serve --paged-attention google/gemma-4-E2B-itinferrs serve google/gemma-4-E2B-itinferrs serve