VulkanForge

LLM inference engine for AMD RDNA4 GPUs. Pure Rust + Vulkan compute shaders. ~14 MB static binary; no runtime dependencies beyond the system Vulkan loader.

First engine doing native FP8 WMMA over Vulkan on consumer AMD hardware (V_WMMA_F32_16X16X16_FP8_FP8 via Mesa 26.1+ shaderFloat8CooperativeMatrix).

This project builds on the foundational work of oldnordic. Without his original ROCmForge implementation — the model loader, the CPU inference path, the GGUF parser, and the overall architecture — none of the WMMA matrix-core optimisations, the multi-model support, or the interactive chat CLI would have been possible. Thank you for making this project a reality.

Highlights

Wins decode on every direct comparison on RDNA4 — beats llama.cpp (Vulkan + ROCm) on Q4_K_M 8B, beats vLLM 0.20.1 ROCm on FP8 single-user decode (1.3–2× ahead).
Native FP8 E4M3 loader that ingests HuggingFace SafeTensors directly, no FP16 round-trip on disk.
All three FP8 scaling strategies auto-detected: per-tensor, per-channel, block-wise [128, 128].
Native FP8 WMMA on Mesa 26.1+ (auto-enabled when the driver advertises shaderFloat8CooperativeMatrix) — +45–58 % FP8 prefill across all three sub-types.
CPU lm_head offload (v0.3.10) — Q6_K weights on CPU RAM, hand-tuned AVX-512 GEMV (Zen 4). Frees ~970 MB VRAM and on 14B FP8 it's 32 % faster than the GPU baseline (17.8 vs 13.5 tok/s).
2× better power efficiency (tok/s/W) on decode vs llama.cpp.
Llama-3, Qwen2.5, Qwen3, Mistral, DeepSeek-R1-Distill, Gemma-4 model families covered (Gemma-4 SafeTensors path produces coherent English with full Markdown structure as of v0.3.14; see docs/MODELS.md and the v0.3.14 entry in CHANGELOG.md for the 8-bug coherence fix-up plus the forward.rs refactor that ships alongside it).
forward.rs Refactor (v0.3.14): the 7816-LOC dispatch file splits into 13 sibling modules with a LayerStep enum + two LayerExecutor impls. The Sprint 43F bug class — "added a per-layer step in decode but forgot prefill" — becomes a compile error in both executors until the new variant is handled.
104 / 105 coherent on the deterministic 15-prompt suite across all production configurations — six 8B+ paths plus Gemma-4-E2B SafeTensors (14/15; the one miss is a Gemma-tokenizer-emoji surrogate). See Quality below.

Quick start

# GGUF — no flag needed, default-everywhere path (Mesa 26.1+ recommended)
vulkanforge chat --model ~/models/Qwen3-8B-Q4_K_M.gguf

# FP8 SafeTensors — one flag, no --tokenizer-from needed (v0.3.13)
# Auto-detects: FP8 model (config.json), native WMMA (Mesa 26.1+),
# AVX-512 (host), model size → CPU lm_head offload (≥ 12 B).
# Auto-loads: tokenizer.json + chat_template from the model dir.
VF_FP8=auto vulkanforge chat --model ~/models/Qwen3-8B-FP8/

# 14 B FP8 with auto CPU lm_head — saves 970 MB VRAM, +9 % decode
VF_FP8=auto vulkanforge chat --model ~/models/Qwen2.5-14B-Instruct-FP8/

The legacy v0.3.10 flags (VULKANFORGE_ENABLE_FP8=1, VF_CPU_LM_HEAD=1) and --tokenizer-from <gguf> still work as explicit overrides — handy when you want CPU lm_head on an 8 B model for VRAM headroom, or want to force a specific tokenizer source for a regression check.

Native FP8 WMMA was a flag in v0.3.10–v0.3.15 (VF_FP8_NATIVE_WMMA=1); v0.3.16 (Sprint 47B) removes the flag and makes the routing capability-driven — VulkanForge picks the native FP8 path automatically iff the driver advertises shaderFloat8CooperativeMatrix. Check with:

vulkaninfo 2>/dev/null | grep shaderFloat8CooperativeMatrix

Build from source:

cargo build --release   # Rust 1.85+, Vulkan headers required

Performance at a glance

All numbers on AMD Radeon RX 9070 XT (gfx1201, RDNA4), Mesa 26.1-rc3 RADV unless noted. Full tables with power data and methodology in docs/BENCHMARKS.md.

GGUF decode (single-user, batch=1)

Engine	Model	Format	Backend	Decode tok/s	tok/s/W
VF v0.3.9	8B Llama	Q4_K_M	Vulkan	121	0.58
llama.cpp	8B Llama	Q4_K_M	Vulkan	114	0.37
llama.cpp	8B Llama	Q4_K_M	ROCm	94	0.30

v0.3.16 15-prompt mixed-workload benchmark

The v0.3.9 row above is vulkanforge bench tg128 (1-token prompt, constant-low KV). The 15-prompt suite is a real-workload mix (smoke / code / prose / reasoning / context-stress / numerics / tokenizer) with generations up to 1024 tokens — KV grows during decode, so steady-state numbers are below tg128. Mesa 26.1.0 RADV.

Model	Prefill avg	Decode avg	Avg W	tok/s/W	Quality
Qwen3-8B Q4_K_M	701 t/s ²	104.0 t/s	258 W	0.40	15/15 ✓
Llama-3.1-8B Q4_K_M	824 t/s	110.0 t/s	271 W	0.41	15/15 ✓
Qwen3-8B FP8	559 t/s	60.7 t/s	193 W	0.32	15/15 ✓
Gemma-4-E2B-it (FP32 SafeTensors)	96 t/s ¹	33.7 t/s	64 W	0.53	15/15 ✓
Gemma-4-E2B-it (Q4_K on-load³)	106 t/s	52.0 t/s	37 W	1.39	15/15 ✓

¹ v0.3.15 lifted the v0.3.14 force_per_token_prefill workaround (33 → 89 → 96 t/s on v0.3.14 / v0.3.15 / v0.3.16). The batch path is bit-identical to the per-token reference (VULKANFORGE_FORCE_PER_TOKEN=1 keeps the v0.3.14 path available as a bisect fallback). Decode-side is on par with the larger models on a tok/s/W basis (best in the test) thanks to the 2 B parameter count keeping power draw at 64 W.

² v0.3.16 closes the v0.3.15 Sprint-46H barrier regression on Owner-only models (Qwen3, Llama). The Q-side barriers are now gated on the Gemma-4 subscriber predicate; Qwen3-Q4_K_M prefill recovers from 638 to 701 t/s (+9.9 %).

³ v0.3.17 adds on-the-fly Q4_K quantization at model load (VF_QUANTIZE_ON_LOAD=1). Gemma-4 SafeTensors weights are quantized FP32 → Q4_K_M on the CPU (rayon-parallelized, Loaded in 13.2 s) and routed through the existing Q4_K shader pipeline. Decode +54 %, power −41 %, tok/s/W 1.39 — best in the suite. VRAM 8.51 → 2.49 GiB (7.1× compression on the quantized tensors; norms / embeddings stay FP32). Coherence identical to the FP32 baseline.

Native FP8 prefill pp=512 (Mesa 26.1+, native FP8 WMMA path)

Model	Scale type	VulkanForge	vLLM 0.20.1 ROCm*
Llama-3.1-8B FP8	per-tensor	1130	14757
Qwen2.5-14B FP8	per-channel	428	(n/a)
Qwen3-8B FP8	block-wise [128,128]	1118	2776

* vLLM 0.20.1 is not optimized for gfx1201 — model load logs Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal!. Per-tensor uses ROCmFP8ScaledMMLinearKernel (specialized); block-wise uses TritonFp8BlockScaledMMKernel (untuned). Run with VLLM_ROCM_USE_AITER=0 --enforce-eager (only working configuration on RDNA4).

Native FP8 decode (single-user, batch=1, decode-only power)

Model	VulkanForge (tok/s @ Avg W)	vLLM 0.20.1 ROCm (tok/s @ Avg W)	VF tok/s/W gain
Llama-3.1-8B FP8	70 t/s @ 166 W	53 t/s @ 159 W	+27 %
Qwen3-8B FP8	62 t/s @ 125 W	22 t/s @ 167 W	+267 %

VF wins decode 1.3–2×; vLLM wins prefill 2.5–12×. Pick the engine that fits the workload — single-user chat is VulkanForge, batch serving is vLLM.

CPU `lm_head` offload (v0.3.10, AVX-512)

VF_CPU_LM_HEAD=1 moves the vocabulary projection onto the CPU as Q6_K, freeing ~970 MB of VRAM. Hand-tuned AVX-512 kernel (Zen 4 / Ice Lake+ runtime-detected; scalar fallback otherwise).

Model	GPU `lm_head`	CPU `lm_head` (AVX-512)	VRAM saved	Verdict
Llama-3.1-8B-FP8	70 tok/s	47.6 tok/s	−970 MB	use for VRAM, not speed
Qwen2.5-14B-FP8	13.5 tok/s	17.8 tok/s (+32 %)	−970 MB	CPU wins both axes

The 14B win is structural: the GPU lm_head GEMV is bandwidth-bound on 644 GB/s VRAM, and offloading it lets DDR5 (32 threads × L3 → DDR5-5600 76 GB/s) carry the work in parallel with the rest of the GPU pipeline freed up. Combined with the 970 MB VRAM saving, the flag is a default-on candidate for 14B FP8 on Zen 4.

Optional features

Feature	Flag	Requires	Effect
FP8 model loading	`VULKANFORGE_ENABLE_FP8=1`	Mesa 26.1+ (or 26.0.6 BF16 path)	Load HuggingFace FP8 SafeTensors
Native FP8 WMMA	(auto)	`shaderFloat8CooperativeMatrix` (Mesa 26.1+)	+45–58 % FP8 prefill
CPU `lm_head` offload	`VF_CPU_LM_HEAD=1`	AVX-512F + BW + VL (Zen 4 / Ice Lake+)	−970 MB VRAM, 14B +32 % decode
On-the-fly Q4_K	`VF_QUANTIZE_ON_LOAD=1`	SafeTensors model with FP32 / BF16 weights	Quantize 2D weights to Q4_K_M at load; ~7× VRAM compression on quantized tensors, routes through the Q4_K shader pipeline. Gemma-4-E2B: decode +54 %, power −41 %, tok/s/W 1.39
FP8 KV-cache	`VULKANFORGE_KV_FP8=1`	Mesa 26.1+ (heterogeneous head_dim auto-handled)	−50 % KV-cache VRAM. Gemma-4-26B-A4B: 880 → 440 MB.
Expert-Grouped MoE prefill (v0.4.5)	`VF_MOE_GROUPED=1`	MoE model (Gemma-4-26B-A4B et al.)	+43 % prefill on Gemma-4-26B-A4B Q3_K_M (65 → 93 t/s). Per-MoE-layer dispatch ~800 → ~450.
Batched MoE decode (v0.4.5)	`VF_MOE_BATCHED_DECODE=1`	MoE model	+4.8 % decode on Gemma-4-26B-A4B Q3_K_M (27.3 → 28.6 t/s). 800 → 450 MoE dispatches/token.
VRAM budget probe (v0.4.5)	(auto)	Linux sysfs `/sys/class/drm/card/device/mem_info_vram_`	Diagnostic. Warns when free VRAM < `VF_VRAM_HEADROOM_GIB` (default 1.0).
Tensor-load progress bar (v0.4.5)	(auto, suppress with `VF_NO_LOAD_PROGRESS=1`)	—	`\r`-overwritten stderr progress bar during GGUF / SafeTensors upload.

All features are opt-in. Without flags, VulkanForge runs GGUF models on Mesa 26.1+ with no special configuration. VF_FP8=auto picks the right FP8 path based on what the driver actually advertises.

Quality (15-prompt benchmark)

The deterministic 15-prompt suite (greedy decoding, temperature = 0) on all six production paths:

Configuration	Coherent	Median decode
Qwen3-8B Q4_K_M GGUF	15/15	107 tok/s
Llama-3.1-8B Q4_K_M GGUF	15/15	112 tok/s
Qwen3-8B-FP8 native WMMA + activation quant	15/15	62 tok/s
Qwen2.5-14B-FP8 native WMMA + CPU `lm_head`	15/15	17 tok/s
Llama-3.1-8B-FP8 native WMMA	15/15	70 tok/s
Llama-3.1-8B-FP8 native WMMA + CPU `lm_head`	15/15	46 tok/s
Gemma-4-E2B-it SafeTensors (v0.3.14, new)	14/15	34 tok/s

104 / 105 prompts (99 %) coherent across the full suite. v0.3.14 adds the Gemma-4-E2B-it SafeTensors path; 14/15 coherent on the suite, the single miss is the emoji-identification prompt where the Gemma-4 tokenizer's surrogate-pair handling drops the input emojis before the model sees them.

v0.3.11 closes the v0.3.10 Llama-FP8 per-tensor edge case (2/15 code-gen prompts collapsing to !) by porting the Sprint 39 per-block activation-absmax + rescale pattern to the per-tensor WMMA path. The fix costs ~5 % on prefill (1197 → 1130 tok/s on 8B-FP8 pp=512) — an unavoidable trade for keeping post-RMS-norm activations inside the FP8 E4M3 ±448 envelope.

Driver requirements

Mesa version	Capabilities
26.1+	Default. Native FP8 WMMA via `shaderFloat8CooperativeMatrix`
26.0.6	Legacy. GGUF + FP8 SafeTensors via the BF16 conversion path (no native FP8 WMMA)

For 14B+ models, set amdgpu.lockup_timeout=10000,10000 on the kernel command line — the default 2 s compute timeout is too short for long prefill submits. Setup details and troubleshooting in docs/INSTALLATION.md.

Documentation

docs/BENCHMARKS.md — full VulkanForge vs llama.cpp vs vLLM tables, power data, methodology
docs/INSTALLATION.md — Mesa setup, kernel parameter, environment variables, troubleshooting
docs/MODELS.md — supported GGUF / FP8 formats, model architectures, FP8 scaling strategies
CHANGELOG.md — release history with per-sprint performance deltas

CLI

vulkanforge chat   --model <PATH> [--tokenizer-from <GGUF>] ...
vulkanforge bench  --model <PATH> [--tokenizer-from <GGUF>] [--runs N]
vulkanforge serve  --model <PATH> [--host 127.0.0.1] [--port 8080] [--cors]

vulkanforge chat --help lists every flag (sampling, max-tokens, think-filter, max-context). The chat REPL accepts /help, /quit, /reset (clear KV cache + history without reloading the model), and a single-shot mode via VF_PROMPT="...".

API Server (v0.4)

OpenAI-compatible HTTP server. Drop-in backend for Open WebUI, SillyTavern, Continue.dev, the OpenAI Python SDK, LangChain, and any other client that speaks Chat Completions.

vulkanforge serve --model ~/models/Qwen3-8B-Q4_K_M.gguf --port 8080

Endpoints

Path	Method	Description
`/v1/chat/completions`	POST	Chat (streaming via SSE + sync JSON)
`/v1/models`	GET	List loaded model
`/health`	GET	Liveness + KV-cache status

The non-prefixed aliases /chat/completions and /models are also routed for clients that omit the /v1/ prefix.

Examples

# Non-streaming
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-8b-q4_k_m",
    "messages": [{"role": "user", "content": "What is a mutex?"}],
    "max_tokens": 100
  }'

# Streaming (SSE)
curl -N http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-8b-q4_k_m",
    "messages": [{"role": "user", "content": "Hi"}],
    "stream": true,
    "stream_options": {"include_usage": true},
    "max_tokens": 50
  }'

Options

Flag	Default	Notes
`--host`	`127.0.0.1`	Use `0.0.0.0` for remote/Docker (no auth in v0.4)
`--port`	`8080`	TCP listen port
`--cors`	off	Enable CORS (browser UIs on different ports)
`--ctx-size`	`2048`	KV-cache capacity in tokens
`--served-model-name`	basename	Override the `model` id reported by `/v1/models`
`--tokenizer-from`	—	Reserved for SafeTensors-serve (v0.4.1)

Scope

In: Streaming + non-streaming chat, frequency_penalty → repetition-penalty mapping, stream_options.include_usage, developer role alias for system, chat_template_kwargs.enable_thinking toggle for <think> filtering.
Out (v0.4): Multi-turn history (system + user only), tool calling, vision content, embeddings, /v1/completions, auth, SafeTensors directory models (use vulkanforge chat for those).

Limitations

Single-stream only — no batch inference, no concurrent sessions on one Forward instance.
Decode at 0.80–1.06× llama.cpp Vulkan (model-dependent); coopmat is prefill-only on this codebase.
FP8 prefill structurally behind ROCm-specialized kernels (vLLM's ROCmFP8ScaledMMLinearKernel is in a different class).
vulkanforge bench accepts only Q4_K_M GGUF; Q8_0 chat works but does not bench.
Mistral / Llama-2 SPM tokenizer not yet wired for FP8 SafeTensors (only gpt2 tokenizer family).

For the full architectural notes and the v0.2.x optimization audit (nine falsified hypotheses against the residual gap to llama.cpp), see CHANGELOG.md.

License

VulkanForge is licensed under the GNU General Public License v3.0.

Name		Name	Last commit message	Last commit date
Latest commit History 190 Commits
.cargo		.cargo
.claude		.claude
docs		docs
examples		examples
src		src
tests		tests
vk_shaders		vk_shaders
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
inference_test_prompts_15.json		inference_test_prompts_15.json
inference_test_prompts_16.json		inference_test_prompts_16.json
inference_test_prompts_mistral_5.json		inference_test_prompts_mistral_5.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VulkanForge

Highlights

Quick start

Performance at a glance

GGUF decode (single-user, batch=1)

v0.3.16 15-prompt mixed-workload benchmark

Native FP8 prefill pp=512 (Mesa 26.1+, native FP8 WMMA path)

Native FP8 decode (single-user, batch=1, decode-only power)

CPU `lm_head` offload (v0.3.10, AVX-512)

Optional features

Quality (15-prompt benchmark)

Driver requirements

Documentation

CLI

API Server (v0.4)

Endpoints

Examples

Options

Scope

Limitations

License

About

Uh oh!

Releases 29

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VulkanForge

Highlights

Quick start

Performance at a glance

GGUF decode (single-user, batch=1)

v0.3.16 15-prompt mixed-workload benchmark

Native FP8 prefill pp=512 (Mesa 26.1+, native FP8 WMMA path)

Native FP8 decode (single-user, batch=1, decode-only power)

CPU lm_head offload (v0.3.10, AVX-512)

Optional features

Quality (15-prompt benchmark)

Driver requirements

Documentation

CLI

API Server (v0.4)

Endpoints

Examples

Options

Scope

Limitations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 29

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

CPU `lm_head` offload (v0.3.10, AVX-512)

Packages