A local LLM inference engine written from scratch in Rust, with hand-written SIMD kernels and GPU compute shaders. No GGML, no llama.cpp — every matrix multiply, attention kernel, and quantization routine is implemented directly.
| Model | Params | Tested |
|---|---|---|
Qwen/Qwen3-0.6B |
0.6B | yes |
Qwen/Qwen3-1.7B |
1.7B | yes |
Qwen/Qwen3-4B |
4B | yes |
Qwen/Qwen3-8B |
8B | yes |
Qwen/Qwen3-14B |
14B | yes |
Qwen/Qwen3-32B |
32B | yes |
| Model | Params | Active | Tested |
|---|---|---|---|
Qwen/Qwen3-30B-A3B |
30B | 3B | yes |
Qwen/Qwen3-235B-A22B |
235B | 22B | - |
| Model | Params | Tested |
|---|---|---|
Qwen/Qwen3-VL-2B-Instruct |
2B | yes |
Qwen/Qwen3-VL-4B-Instruct |
4B | yes |
Qwen/Qwen3-VL-8B-Instruct |
8B | yes |
Qwen/Qwen3-VL-32B-Instruct |
32B | yes |
Qwen/Qwen3-VL-30B-A3B-Instruct |
30B (MoE, 3B active) | yes |
Qwen/Qwen3-VL-235B-A22B-Instruct |
235B (MoE, 22B active) | - |
| Model | Params | Tested |
|---|---|---|
mistralai/Ministral-3-3B-Instruct-2512-BF16 |
3B | yes |
mistralai/Ministral-3-8B-Instruct-2512-BF16 |
8B | yes |
mistralai/Ministral-3-14B-Instruct-2512-BF16 |
14B | yes |
mistralai/Mistral-Small-3.2-24B-Instruct-2506 |
24B | yes |
mistralai/Magistral-Small-2509 |
24B | yes |
mistralai/Devstral-Small-2-24B-Instruct-2512 |
24B | yes |
Note: Some Mistral models ship with
tekken.jsoninstead oftokenizer.json. The BF16 variants (recommended) includetokenizer.jsondirectly.
Models are hosted on HuggingFace. Use the hf CLI to download:
# Install the HuggingFace CLI
pip install huggingface_hub
# Download a model (example: Qwen3-4B)
hf download Qwen/Qwen3-4B
# Download a vision-language model
hf download Qwen/Qwen3-VL-8B-Instruct
# Download a Mistral model (use BF16 variant for tokenizer.json)
hf download mistralai/Ministral-3-8B-Instruct-2512-BF16Models are cached in ~/.cache/huggingface/hub/. Pass the snapshot path to --model:
herbert-cli --model ~/.cache/huggingface/hub/models--Qwen--Qwen3-4B/snapshots/<hash>| Format | Weights | KV Cache | Notes |
|---|---|---|---|
| BF16 | 2 bytes/param | — | Full precision baseline |
| INT8 | 1 byte/param | 1 byte/elem | Per-channel symmetric, VNNI acceleration |
| Q4 | 0.5 bytes/param | — | Per-group (group_size=32), pre-interleaved nibbles |
Mixed-precision supported: Q4, INT8, and BF16 weights in the same model.
- BF16 scalar — pure f32, no SIMD (reference/verification)
- BF16 AVX-512 —
vdpbf16psnative BF16 dot-product (Zen4, Sapphire Rapids) - INT8 AVX-512 —
VPDPBUSDVNNI with fused QKV and gate+up projections - Q4 AVX-512 — pre-interleaved nibble layout, huge pages, fused tile-local scales
All critical inner loops are hand-written x86-64 assembly (.S files).
- Metal (macOS) — 92 compute shaders covering Q4/INT8/BF16 matvec, matmul, flash attention, MoE, vision encoding, KV cache management
- Vulkan (Linux) — 31 GLSL compute shaders with cooperative matrix support, portable across AMD/NVIDIA/Intel. Use
--gpu listto enumerate devices,--gpu Nto select (0=first discrete, 1000+=global index for iGPU)
- Streaming token generation with UTF-8 multi-byte handling
- KV cache quantization (BF16, INT8)
- Thinking mode control for reasoning models (
--nothink,--think-budget) - Repetition loop detection
- Token sampling: temperature, top-k, top-p, greedy
- Mixture-of-Experts with batched expert dispatch
- Vision-Language support with multi-image input
Interactive chat or single-shot inference.
# Single-shot
herbert-cli --model <path> --prompt "What is 2+2?"
# Interactive chat
herbert-cli --model <path>
# With a system prompt
herbert-cli --model <path> --system "You are a helpful assistant."
# Vision (Qwen3-VL or Pixtral models)
herbert-cli --model <path> --image photo.jpg --prompt "Describe this image."
# Greedy decoding (temperature=0)
herbert-cli --model <path> --temperature 0 --prompt "Hello"
# Show stats after generation
herbert-cli --model <path> --prompt "Hello" --verbose
# Choose a specific backend
herbert-cli --model <path> --backend metal-q4
herbert-cli --model <path> --backend help # list available backends
# GPU selection (Vulkan)
herbert-cli --gpu list # list available GPUs
herbert-cli --gpu 0 --model <path> --backend vulkan-bf16 # first discrete GPU (default)
herbert-cli --gpu 1000 --model <path> --backend vulkan-bf16 # iGPU (global index)
# Tool calling
herbert-cli --model <path> --toolsSampling options: --temperature (default 0.4), --top-k (default 40), --top-p (default 0.9), --max-tokens (default 2048)
Chat commands: /help, /config, /temp, /topk, /topp, /think, /nothink, /tools, /image, /stats, /arch, /clear, /quit
Built-in tools (with --tools): get_datetime, calculate, list_directory, read_file
Anthropic Messages API compatible server with SSE streaming.
herbert-server --model <path> --addr 0.0.0.0:3000
herbert-server --model <path> --addr 0.0.0.0:3000 --api-key mysecretkeyEndpoints:
POST /v1/messages— chat completion (streaming SSE or JSON)POST /v1/messages/count_tokens— token countingPOST /v1/tokenize— tokenizationGET /v1/metrics— performance metrics
Native desktop application built with Tauri 2 and Vue 3.
- Multi-conversation chat with sidebar
- Streaming with markdown rendering and syntax highlighting
- Image support with drag & drop and encoding progress
- Per-message performance stats
- Model loading with progress feedback
- Settings panel for sampling parameters and backend selection
Most inference engines optimize for prefill throughput (batched GEMM). Herbert takes a different approach: it is built around the assumption that inference performance is limited by memory bandwidth, not compute. It prioritizes decode speed, which is what determines the user experience in interactive use.
At decode time, the bottleneck is memory bandwidth — each generated token requires reading the full KV cache. Herbert addresses this with an INT8 KV cache that halves the bandwidth requirement compared to FP16, using hand-written VNNI kernels that avoid the dequantization overhead seen in other implementations.
In practice:
- Prefill: not the primary optimization target yet (batched matmul is in progress). Currently 1.5-2x behind llama.cpp on dense models
- Decode (short context): on par with llama.cpp
- Decode (long context): performance improves as context grows, because KV cache bandwidth becomes the dominant cost — and Herbert reduces that cost
On Mixture-of-Experts models, Herbert's expert batching also improves prefill, leading to better performance across the board.
These results are consistent across all four tested models and architectures (dense, VL, MoE).
Every kernel optimization is validated empirically using:
- Hardware performance counters (AMD Zen4 PMC via
perf_event_open+rdpmcfast-path) — cycle-precise, core-pinned, multi-pass measurement of L1/L2/L3 cache behavior, retired instructions, and branch mispredictions - Memory bandwidth sweeps — working set sizes from L1 (48KB) through L2 (1.25MB) to DRAM (64MB+) to establish theoretical bandwidth ceilings
- Wall-clock throughput — end-to-end prefill and decode measurements with controlled cooldown periods between runs
CPU-only benchmarks on an AMD Ryzen 9 7900 (12C/24T, AVX-512, 96 GB DDR5), comparing Herbert with llama.cpp, HF Transformers, vLLM-CPU, and ONNX Runtime.
Herbert's KV cache quantization (INT8 by default) gives it an increasing advantage over llama.cpp as context length grows. On short contexts, llama.cpp is slightly faster; on longer contexts (1K+ tokens), Herbert pulls ahead.
Qwen3-0.6B — Q4 decode
| Context | Herbert Q4 | llama.cpp Q4 | vLLM BF16 | HF Transformers |
|---|---|---|---|---|
| ~100 | 110 | 110 | 30 | 27 |
| ~1000 | 101 | 97 | 29 | 22 |
| ~5000 | 70 | 56 | 25 | 11 |
| ~10000 | 52 | 35 | 18 | 7 |
| ~16000 | 29 | 17 | 14 | 3 |
Qwen3-VL-30B-A3B (MoE) — Q4 decode
| Context | Herbert Q4 | llama.cpp Q4 | vLLM BF16 | HF Transformers |
|---|---|---|---|---|
| ~300 | 27 | 25 | 6 | 5 |
| ~1300 | 25 | 23 | 6 | 5 |
| ~3200 | 21 | 19 | 6 | 4 |
| ~6400 | 17 | 15 | — | — |
On MoE models, Herbert also wins on prefill thanks to its expert batching optimizations (moe-v6).
llama.cpp has faster prefill on dense models (~1.5-2x) due to its batched GEMM. vLLM-CPU has the best prefill overall thanks to chunked prefill + torch matmul.
Qwen3-0.6B — prefill at ~1000 tokens
| Engine | t/s |
|---|---|
| vLLM-CPU BF16 | 1989 |
| HF Transformers BF16 | 1292 |
| llama.cpp BF16 | 1057 |
| Herbert Q4 | 751 |
Full benchmarks with all prompt sizes, quantizations, and KV cache configurations are in docs/benchmarks/:
- Qwen3-0.6B — dense 0.6B, 5 engines including ONNX Runtime
- Qwen3-VL-4B — dense VL 4B
- Ministral-3 3B — dense text 3B (Mistral)
- Qwen3-VL-30B-A3B — MoE 30B (3B active)
cargo build --releaseThe build auto-detects available CPU features (AVX-512, VNNI, AVX-512 BF16) and compiles the appropriate assembly kernels. Metal shaders are compiled on macOS, Vulkan shaders on Linux.
| Platform | CPU Backends | GPU Backend |
|---|---|---|
| macOS (Apple Silicon) | BF16 scalar/Neo | Metal |
| Linux (x86-64) | BF16, BF16-AVX2/512, INT8-AVX2/512, Q4-AVX2/512 | Vulkan |
MIT — see LICENSE