inferrs

A TurboQuant LLM inference server.

Why inferrs?

Most LLM serving stacks force a trade-off between features and resource usage. inferrs targets both:

	inferrs	vLLM	llama.cpp
Language	Rust	Python/C++	C/C++
Streaming (SSE)	✓	✓	✓
KV cache management	TurboQuant, Per-context alloc, PagedAttention	PagedAttention	Per-context alloc
Memory friendly	✓ — lightweight	✗ — claims most GPU memory	✓ — lightweight
Binary footprint	Single binary	Python environment + deps	Single binary

Features

OpenAI-compatible API — /v1/completions, /v1/chat/completions, /v1/models, /health
Anthropic-compatible API — /v1/messages (streaming and non-streaming)
Ollama-compatible API — /api/generate, /api/chat, /api/tags, /api/ps, /api/show, /api/version
Hardware backends — CUDA, ROCm, Metal, Hexagon, OpenVino, MUSA, CANN, Vulkan and CPU

Quick start

Install

macOS / Linux

brew tap ericcurtin/inferrs
brew install inferrs

Windows

scoop bucket add inferrs https://github.com/ericcurtin/scoop-inferrs
scoop install inferrs

Run

inferrs run google/gemma-4-E2B-it

Serve

Serve a specific model vLLM-style

inferrs serve --paged-attention google/gemma-4-E2B-it

Serve a specific model llama.cpp-style

inferrs serve google/gemma-4-E2B-it

Serve models ollama-style

inferrs serve

Name		Name	Last commit message	Last commit date
Latest commit History 446 Commits
.cargo		.cargo
.gemini		.gemini
.github/workflows		.github/workflows
backends		backends
candle-core		candle-core
candle-nn		candle-nn
inferrs-benchmark		inferrs-benchmark
inferrs-kernels		inferrs-kernels
inferrs-models		inferrs-models
inferrs-multimodal		inferrs-multimodal
inferrs		inferrs
oci-models		oci-models
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

inferrs

Why inferrs?

Features

Quick start

Install

Run

Serve

Serve a specific model vLLM-style

Serve a specific model llama.cpp-style

Serve models ollama-style

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 8

Languages

Folders and files

Latest commit

History

Repository files navigation

inferrs

Why inferrs?

Features

Quick start

Install

Run

Serve

Serve a specific model vLLM-style

Serve a specific model llama.cpp-style

Serve models ollama-style

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 8

Languages

Packages