LLM inference engine for AMD RDNA4 GPUs. Pure Rust + Vulkan compute shaders. ~14 MB static binary; no runtime dependencies beyond the system Vulkan loader.
First engine doing native FP8 WMMA over Vulkan on consumer AMD
hardware (V_WMMA_F32_16X16X16_FP8_FP8 via Mesa 26.1+
shaderFloat8CooperativeMatrix).
This project builds on the foundational work of oldnordic. Without his original ROCmForge implementation — the model loader, the CPU inference path, the GGUF parser, and the overall architecture — none of the WMMA matrix-core optimisations, the multi-model support, or the interactive chat CLI would have been possible. Thank you for making this project a reality.
- Wins decode on every direct comparison on RDNA4 — beats llama.cpp (Vulkan + ROCm) on Q4_K_M 8B, beats vLLM 0.20.1 ROCm on FP8 single-user decode (1.3–2× ahead).
- Native FP8 E4M3 loader that ingests HuggingFace SafeTensors directly, no FP16 round-trip on disk.
- All three FP8 scaling strategies auto-detected:
per-tensor, per-channel, block-wise
[128, 128]. - Native FP8 WMMA on Mesa 26.1+ (auto-enabled when the driver
advertises
shaderFloat8CooperativeMatrix) — +45–58 % FP8 prefill across all three sub-types. - CPU
lm_headoffload (v0.3.10) — Q6_K weights on CPU RAM, hand-tuned AVX-512 GEMV (Zen 4). Frees ~970 MB VRAM and on 14B FP8 it's 32 % faster than the GPU baseline (17.8 vs 13.5 tok/s). - 2× better power efficiency (tok/s/W) on decode vs llama.cpp.
- Llama-3, Qwen2.5, Qwen3, Mistral, DeepSeek-R1-Distill, Gemma-4
model families covered (Gemma-4 SafeTensors path produces coherent
English with full Markdown structure as of v0.3.14; see
docs/MODELS.md and the v0.3.14 entry in
CHANGELOG.md for the 8-bug coherence fix-up plus
the
forward.rsrefactor that ships alongside it). forward.rsRefactor (v0.3.14): the 7816-LOC dispatch file splits into 13 sibling modules with aLayerStepenum + twoLayerExecutorimpls. The Sprint 43F bug class — "added a per-layer step in decode but forgot prefill" — becomes a compile error in both executors until the new variant is handled.- 104 / 105 coherent on the deterministic 15-prompt suite across all production configurations — six 8B+ paths plus Gemma-4-E2B SafeTensors (14/15; the one miss is a Gemma-tokenizer-emoji surrogate). See Quality below.
# GGUF — no flag needed, default-everywhere path (Mesa 26.1+ recommended)
vulkanforge chat --model ~/models/Qwen3-8B-Q4_K_M.gguf
# FP8 SafeTensors — one flag, no --tokenizer-from needed (v0.3.13)
# Auto-detects: FP8 model (config.json), native WMMA (Mesa 26.1+),
# AVX-512 (host), model size → CPU lm_head offload (≥ 12 B).
# Auto-loads: tokenizer.json + chat_template from the model dir.
VF_FP8=auto vulkanforge chat --model ~/models/Qwen3-8B-FP8/
# 14 B FP8 with auto CPU lm_head — saves 970 MB VRAM, +9 % decode
VF_FP8=auto vulkanforge chat --model ~/models/Qwen2.5-14B-Instruct-FP8/The legacy v0.3.10 flags (VULKANFORGE_ENABLE_FP8=1,
VF_CPU_LM_HEAD=1) and --tokenizer-from <gguf> still work as
explicit overrides — handy when you want CPU lm_head on an 8 B
model for VRAM headroom, or want to force a specific tokenizer
source for a regression check.
Native FP8 WMMA was a flag in v0.3.10–v0.3.15
(VF_FP8_NATIVE_WMMA=1); v0.3.16 (Sprint 47B) removes the flag and
makes the routing capability-driven — VulkanForge picks the native
FP8 path automatically iff the driver advertises
shaderFloat8CooperativeMatrix. Check with:
vulkaninfo 2>/dev/null | grep shaderFloat8CooperativeMatrixBuild from source:
cargo build --release # Rust 1.85+, Vulkan headers requiredAll numbers on AMD Radeon RX 9070 XT (gfx1201, RDNA4), Mesa 26.1-rc3 RADV unless noted. Full tables with power data and methodology in docs/BENCHMARKS.md.
| Engine | Model | Format | Backend | Decode tok/s | tok/s/W |
|---|---|---|---|---|---|
| VF v0.3.9 | 8B Llama | Q4_K_M | Vulkan | 121 | 0.58 |
| llama.cpp | 8B Llama | Q4_K_M | Vulkan | 114 | 0.37 |
| llama.cpp | 8B Llama | Q4_K_M | ROCm | 94 | 0.30 |
The v0.3.9 row above is vulkanforge bench tg128 (1-token prompt,
constant-low KV). The 15-prompt suite is a real-workload mix (smoke /
code / prose / reasoning / context-stress / numerics / tokenizer) with
generations up to 1024 tokens — KV grows during decode, so steady-state
numbers are below tg128. Mesa 26.1.0 RADV.
| Model | Prefill avg | Decode avg | Avg W | tok/s/W | Quality |
|---|---|---|---|---|---|
| Qwen3-8B Q4_K_M | 701 t/s ² | 104.0 t/s | 258 W | 0.40 | 15/15 ✓ |
| Llama-3.1-8B Q4_K_M | 824 t/s | 110.0 t/s | 271 W | 0.41 | 15/15 ✓ |
| Qwen3-8B FP8 | 559 t/s | 60.7 t/s | 193 W | 0.32 | 15/15 ✓ |
| Gemma-4-E2B-it (FP32 SafeTensors) | 96 t/s ¹ | 33.7 t/s | 64 W | 0.53 | 15/15 ✓ |
| Gemma-4-E2B-it (Q4_K on-load³) | 106 t/s | 52.0 t/s | 37 W | 1.39 | 15/15 ✓ |
¹ v0.3.15 lifted the v0.3.14 force_per_token_prefill workaround
(33 → 89 → 96 t/s on v0.3.14 / v0.3.15 / v0.3.16). The batch path is
bit-identical to the per-token reference
(VULKANFORGE_FORCE_PER_TOKEN=1 keeps the v0.3.14 path available
as a bisect fallback). Decode-side is on par with the larger models
on a tok/s/W basis (best in the test) thanks to the 2 B parameter
count keeping power draw at 64 W.
² v0.3.16 closes the v0.3.15 Sprint-46H barrier regression on Owner-only models (Qwen3, Llama). The Q-side barriers are now gated on the Gemma-4 subscriber predicate; Qwen3-Q4_K_M prefill recovers from 638 to 701 t/s (+9.9 %).
³ v0.3.17 adds on-the-fly Q4_K quantization at model load
(VF_QUANTIZE_ON_LOAD=1). Gemma-4 SafeTensors weights are
quantized FP32 → Q4_K_M on the CPU (rayon-parallelized,
Loaded in 13.2 s) and routed through the existing Q4_K shader
pipeline. Decode +54 %, power −41 %, tok/s/W 1.39 — best in the
suite. VRAM 8.51 → 2.49 GiB (7.1× compression on the quantized
tensors; norms / embeddings stay FP32). Coherence identical to
the FP32 baseline.
| Model | Scale type | VulkanForge | vLLM 0.20.1 ROCm* |
|---|---|---|---|
| Llama-3.1-8B FP8 | per-tensor | 1130 | 14757 |
| Qwen2.5-14B FP8 | per-channel | 428 | (n/a) |
| Qwen3-8B FP8 | block-wise [128,128] | 1118 | 2776 |
* vLLM 0.20.1 is not optimized for gfx1201 — model load logs
Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal!. Per-tensor uses ROCmFP8ScaledMMLinearKernel (specialized);
block-wise uses TritonFp8BlockScaledMMKernel (untuned). Run with
VLLM_ROCM_USE_AITER=0 --enforce-eager (only working configuration on RDNA4).
| Model | VulkanForge (tok/s @ Avg W) | vLLM 0.20.1 ROCm (tok/s @ Avg W) | VF tok/s/W gain |
|---|---|---|---|
| Llama-3.1-8B FP8 | 70 t/s @ 166 W | 53 t/s @ 159 W | +27 % |
| Qwen3-8B FP8 | 62 t/s @ 125 W | 22 t/s @ 167 W | +267 % |
VF wins decode 1.3–2×; vLLM wins prefill 2.5–12×. Pick the engine that fits the workload — single-user chat is VulkanForge, batch serving is vLLM.
VF_CPU_LM_HEAD=1 moves the vocabulary projection onto the CPU as
Q6_K, freeing ~970 MB of VRAM. Hand-tuned AVX-512 kernel (Zen 4 /
Ice Lake+ runtime-detected; scalar fallback otherwise).
| Model | GPU lm_head |
CPU lm_head (AVX-512) |
VRAM saved | Verdict |
|---|---|---|---|---|
| Llama-3.1-8B-FP8 | 70 tok/s | 47.6 tok/s | −970 MB | use for VRAM, not speed |
| Qwen2.5-14B-FP8 | 13.5 tok/s | 17.8 tok/s (+32 %) | −970 MB | CPU wins both axes |
The 14B win is structural: the GPU lm_head GEMV is bandwidth-bound
on 644 GB/s VRAM, and offloading it lets DDR5 (32 threads × L3 →
DDR5-5600 76 GB/s) carry the work in parallel with the rest of the
GPU pipeline freed up. Combined with the 970 MB VRAM saving, the
flag is a default-on candidate for 14B FP8 on Zen 4.
| Feature | Flag | Requires | Effect |
|---|---|---|---|
| FP8 model loading | VULKANFORGE_ENABLE_FP8=1 |
Mesa 26.1+ (or 26.0.6 BF16 path) | Load HuggingFace FP8 SafeTensors |
| Native FP8 WMMA | (auto) | shaderFloat8CooperativeMatrix (Mesa 26.1+) |
+45–58 % FP8 prefill |
CPU lm_head offload |
VF_CPU_LM_HEAD=1 |
AVX-512F + BW + VL (Zen 4 / Ice Lake+) | −970 MB VRAM, 14B +32 % decode |
| On-the-fly Q4_K | VF_QUANTIZE_ON_LOAD=1 |
SafeTensors model with FP32 / BF16 weights | Quantize 2D weights to Q4_K_M at load; ~7× VRAM compression on quantized tensors, routes through the Q4_K shader pipeline. Gemma-4-E2B: decode +54 %, power −41 %, tok/s/W 1.39 |
| FP8 KV-cache | VULKANFORGE_KV_FP8=1 |
Mesa 26.1+ (heterogeneous head_dim auto-handled) | −50 % KV-cache VRAM. Gemma-4-26B-A4B: 880 → 440 MB. |
| Expert-Grouped MoE prefill (v0.4.5) | VF_MOE_GROUPED=1 |
MoE model (Gemma-4-26B-A4B et al.) | +43 % prefill on Gemma-4-26B-A4B Q3_K_M (65 → 93 t/s). Per-MoE-layer dispatch ~800 → ~450. |
| Batched MoE decode (v0.4.5) | VF_MOE_BATCHED_DECODE=1 |
MoE model | +4.8 % decode on Gemma-4-26B-A4B Q3_K_M (27.3 → 28.6 t/s). 800 → 450 MoE dispatches/token. |
| VRAM budget probe (v0.4.5) | (auto) | Linux sysfs /sys/class/drm/card*/device/mem_info_vram_* |
Diagnostic. Warns when free VRAM < VF_VRAM_HEADROOM_GIB (default 1.0). |
| Tensor-load progress bar (v0.4.5) | (auto, suppress with VF_NO_LOAD_PROGRESS=1) |
— | \r-overwritten stderr progress bar during GGUF / SafeTensors upload. |
All features are opt-in. Without flags, VulkanForge runs GGUF models
on Mesa 26.1+ with no special configuration. VF_FP8=auto picks
the right FP8 path based on what the driver actually advertises.
The deterministic 15-prompt suite (greedy decoding, temperature = 0) on all six production paths:
| Configuration | Coherent | Median decode |
|---|---|---|
| Qwen3-8B Q4_K_M GGUF | 15/15 | 107 tok/s |
| Llama-3.1-8B Q4_K_M GGUF | 15/15 | 112 tok/s |
| Qwen3-8B-FP8 native WMMA + activation quant | 15/15 | 62 tok/s |
Qwen2.5-14B-FP8 native WMMA + CPU lm_head |
15/15 | 17 tok/s |
| Llama-3.1-8B-FP8 native WMMA | 15/15 | 70 tok/s |
Llama-3.1-8B-FP8 native WMMA + CPU lm_head |
15/15 | 46 tok/s |
| Gemma-4-E2B-it SafeTensors (v0.3.14, new) | 14/15 | 34 tok/s |
104 / 105 prompts (99 %) coherent across the full suite. v0.3.14 adds the Gemma-4-E2B-it SafeTensors path; 14/15 coherent on the suite, the single miss is the emoji-identification prompt where the Gemma-4 tokenizer's surrogate-pair handling drops the input emojis before the model sees them.
v0.3.11 closes the v0.3.10 Llama-FP8 per-tensor edge case (2/15
code-gen prompts collapsing to !) by porting the Sprint 39
per-block activation-absmax + rescale pattern to the per-tensor
WMMA path. The fix costs ~5 % on prefill (1197 → 1130 tok/s on
8B-FP8 pp=512) — an unavoidable trade for keeping post-RMS-norm
activations inside the FP8 E4M3 ±448 envelope.
| Mesa version | Capabilities |
|---|---|
| 26.1+ | Default. Native FP8 WMMA via shaderFloat8CooperativeMatrix |
| 26.0.6 | Legacy. GGUF + FP8 SafeTensors via the BF16 conversion path (no native FP8 WMMA) |
For 14B+ models, set amdgpu.lockup_timeout=10000,10000 on the
kernel command line — the default 2 s compute timeout is too short
for long prefill submits. Setup details and troubleshooting in
docs/INSTALLATION.md.
- docs/BENCHMARKS.md — full VulkanForge vs llama.cpp vs vLLM tables, power data, methodology
- docs/INSTALLATION.md — Mesa setup, kernel parameter, environment variables, troubleshooting
- docs/MODELS.md — supported GGUF / FP8 formats, model architectures, FP8 scaling strategies
- CHANGELOG.md — release history with per-sprint performance deltas
vulkanforge chat --model <PATH> [--tokenizer-from <GGUF>] ...
vulkanforge bench --model <PATH> [--tokenizer-from <GGUF>] [--runs N]
vulkanforge serve --model <PATH> [--host 127.0.0.1] [--port 8080] [--cors]
vulkanforge chat --help lists every flag (sampling, max-tokens,
think-filter, max-context). The chat REPL accepts /help, /quit,
/reset (clear KV cache + history without reloading the model), and
a single-shot mode via VF_PROMPT="...".
OpenAI-compatible HTTP server. Drop-in backend for Open WebUI, SillyTavern, Continue.dev, the OpenAI Python SDK, LangChain, and any other client that speaks Chat Completions.
vulkanforge serve --model ~/models/Qwen3-8B-Q4_K_M.gguf --port 8080
| Path | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | Chat (streaming via SSE + sync JSON) |
/v1/models |
GET | List loaded model |
/health |
GET | Liveness + KV-cache status |
The non-prefixed aliases /chat/completions and /models are also
routed for clients that omit the /v1/ prefix.
# Non-streaming
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-8b-q4_k_m",
"messages": [{"role": "user", "content": "What is a mutex?"}],
"max_tokens": 100
}'
# Streaming (SSE)
curl -N http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-8b-q4_k_m",
"messages": [{"role": "user", "content": "Hi"}],
"stream": true,
"stream_options": {"include_usage": true},
"max_tokens": 50
}'| Flag | Default | Notes |
|---|---|---|
--host |
127.0.0.1 |
Use 0.0.0.0 for remote/Docker (no auth in v0.4) |
--port |
8080 |
TCP listen port |
--cors |
off | Enable CORS (browser UIs on different ports) |
--ctx-size |
2048 |
KV-cache capacity in tokens |
--served-model-name |
basename | Override the model id reported by /v1/models |
--tokenizer-from |
— | Reserved for SafeTensors-serve (v0.4.1) |
- In: Streaming + non-streaming chat,
frequency_penalty→ repetition-penalty mapping,stream_options.include_usage,developerrole alias forsystem,chat_template_kwargs.enable_thinkingtoggle for<think>filtering. - Out (v0.4): Multi-turn history (system + user only), tool
calling, vision content, embeddings,
/v1/completions, auth, SafeTensors directory models (usevulkanforge chatfor those).
- Single-stream only — no batch inference, no concurrent sessions on
one
Forwardinstance. - Decode at 0.80–1.06× llama.cpp Vulkan (model-dependent); coopmat is prefill-only on this codebase.
- FP8 prefill structurally behind ROCm-specialized kernels (vLLM's
ROCmFP8ScaledMMLinearKernelis in a different class). vulkanforge benchaccepts only Q4_K_M GGUF; Q8_0 chat works but does not bench.- Mistral / Llama-2 SPM tokenizer not yet wired for FP8 SafeTensors
(only
gpt2tokenizer family).
For the full architectural notes and the v0.2.x optimization audit (nine falsified hypotheses against the residual gap to llama.cpp), see CHANGELOG.md.
VulkanForge is licensed under the GNU General Public License v3.0.