Local LLM inference on consumer GPUs and Apple Silicon — no ROCm, no MLX, one binary.
35B parameter model running locally — Zig + Vulkan/Metal, no ROCm, no MLX
| Platform | GPU | Backend | Status |
|---|---|---|---|
| Linux | AMD RDNA4 (RX 9070, AI PRO R9700) | Vulkan | Primary — hand-tuned shaders |
| Linux | AMD RDNA3 (RX 7900 XTX, etc.) | Vulkan | Supported |
| Linux | Intel Arc Xe2 / Battlemage | Vulkan | Experimental bring-up |
| macOS | Apple Silicon (M1, M2, M3, M4, M5) | Metal | Supported — native MSL shaders |
ZINC focuses on current local-inference models people are actively running: Qwen 3.5/3.6 and Gemma 4 today, with a managed catalog that stays narrow on purpose. Older Llama/Mistral/Gemma generations may work eventually, but broad legacy-model coverage is not the main optimization target.
Latest checked-in benchmark artifact, same machine, same weights, same prompt:
| Platform | Compared models | Decode vs llama.cpp | Prefill vs llama.cpp | Read this as |
|---|---|---|---|---|
| AMD RDNA4 / Vulkan | 5 | 99% avg, 2 model wins | 24% avg | Decode is close; prefill is the active gap |
| Apple Silicon / Metal | 5 | 62% avg, 1 model win | 58% avg, 2 model wins | Mixed by model; tuning is still uneven |
| Intel Arc / Vulkan | 1 | 84% | 152% | Experimental, not full-catalog coverage yet |
Full per-model numbers are in Benchmarks and on the public dashboard: zolotukhin.ai/zinc/benchmarks.
Works the same on Linux (AMD GPU) and macOS (Apple Silicon):
git clone https://github.com/zolotukhin/zinc.git
cd zinc
zig build -Doptimize=ReleaseFast
# On RDNA4 Linux, enable cooperative matrix
export RADV_PERFTEST=coop_matrix # skip on macOS
# Verify GPU, shaders, and runtime
./zig-out/bin/zinc --check
# See which models fit this machine
./zig-out/bin/zinc model list
# Download a model
./zig-out/bin/zinc model pull qwen35-9b-q4k-m
# Run a prompt (--chat applies the model's chat template for instruct models)
./zig-out/bin/zinc --model-id qwen35-9b-q4k-m --prompt "Hello" --chat
# Or open the chat UI in your browser
./zig-out/bin/zinc chatThe server exposes the built-in chat UI at / and an OpenAI-compatible API at /v1.
ZINC is usable today as a local, single-user inference engine for the validated models listed below.
| Area | What you can do today |
|---|---|
| Run models | Use the CLI for single-stream inference on supported GGUF models |
| Chat | Start the built-in browser UI with zinc chat, including streaming and thinking-mode display |
| API | Serve OpenAI-compatible /v1 endpoints with streaming responses |
| Models | Manage catalog models with list, pull, use, active, and rm |
| AMD GPUs | Run the Vulkan backend with RDNA-tuned wave64, cooperative-matrix, and fused-op shaders |
| Apple Silicon | Run the native Metal backend with MSL shaders, zero-copy mmap, and simdgroup ops |
| Setup | Let ZINC select the available backend at build time: Vulkan on Linux, Metal on macOS |
- Continuous batching and multi-tenant serving are still roadmap work
- The supported-model list is intentionally narrow
- Apple Silicon performance tuning is ongoing (RDNA4 path is more mature)
Consumer GPUs have the hardware for fast LLM inference — bandwidth, compute, VRAM — but the software doesn't use it:
- AMD RDNA3/RDNA4: ROCm doesn't support them. vLLM requires ROCm. llama.cpp's Vulkan path has no RDNA-specific tuning. These $500–1500 cards sit idle.
- Apple Silicon: MLX and llama.cpp Metal work, but leave performance on the table. No engine is built from scratch around Metal's strengths (unified memory, simdgroup ops, zero-copy mmap).
ZINC builds an inference engine tuned for the hardware you actually have.
Hand-tuned shaders for each platform. On AMD: wave64, cooperative matrix, architecture-aware tiling via Vulkan compute. On Apple Silicon: native MSL kernels with simdgroup reductions, zero-copy model loading, and Metal pipeline tuning. Not a generic backend that happens to run — built to extract real performance from each GPU.
One binary, no driver stack. No ROCm, no CUDA, no Python. Build with Zig, point at a GGUF, run inference. The right backend (Vulkan or Metal) is selected automatically at build time.
Drop-in compatible. OpenAI-compatible API, built-in chat UI, managed model catalog. Point your existing client at it and it works.
The list below matches the current managed model catalog, not a broader wishlist.
-
Qwen 3.5 9B Q4_K_M — supported on AMD RDNA4 16/32 GB, Apple Silicon, and Intel Arc
-
Qwen3.6 35B-A3B UD Q4_K_XL — supported on AMD RDNA4 32 GB and Apple Silicon
-
Qwen3.6 27B Dense Q4_K_M — experimental on AMD RDNA4 32 GB and Apple Silicon
-
Gemma 4 31B Q4_K_M — supported on AMD RDNA4 32 GB and Apple Silicon
-
Gemma 4 26B-A4B MoE Q4_K_M — supported on AMD RDNA4 32 GB and Apple Silicon
-
Use
zinc model list --jsonfor machine-readable model metadata -
Current throughput and latency numbers live on the public benchmarks page: zolotukhin.ai/zinc/benchmarks
Quantization formats: Q4_K, Q5_K, Q6_K, Q8_0, Q5_0, MXFP4, F16, F32
| Tool | Install |
|---|---|
| Zig 0.15.2+ | ziglang.org/download |
| Vulkan loader + tools | apt install libvulkan-dev vulkan-tools (Linux) or brew install vulkan-loader vulkan-headers (macOS) |
glslc on Linux |
apt install glslc |
| Bun for tests and the docs site | curl -fsSL https://bun.sh/install | bash |
Important: On Linux with RDNA4, newer glslc releases can cause a large regression. Use the system package version.
git clone https://github.com/zolotukhin/zinc.git
cd zinc
# Build the CLI and server
# macOS: shaders are skipped
# Linux: shaders are compiled automatically
zig build -Doptimize=ReleaseFastThe binary is placed in zig-out/bin/zinc. Compiled SPIR-V shaders go to zig-out/share/zinc/shaders/.
Use ReleaseFast for any performance measurement or server deployment. Plain zig build is not a fair throughput baseline.
Before your first prompt, run --check. The target state is a clean READY [OK] run with no warnings.
# General machine + Vulkan + shader preflight
./zig-out/bin/zinc --check
# Recommended on RDNA4 before measuring performance
export RADV_PERFTEST=coop_matrix
./zig-out/bin/zinc --check
# Check one exact GGUF file
./zig-out/bin/zinc --check -m /path/to/model.gguf
# Check one managed catalog model by id
./zig-out/bin/zinc --check --model-id qwen36-35b-a3b-q4k-xl--check verifies:
- host environment and RDNA4-specific shell hints
- compiled shader assets
- Vulkan device discovery and the selected GPU
- GGUF metadata when you pass
-m /path/to/model.gguf - managed-model compatibility when you pass
--model-id <id> - estimated single-GPU VRAM fit for the current runtime
If --check reports warnings, treat them as setup work to finish before judging runtime behavior. For the full walkthrough, see Running ZINC and Hardware requirements.
The README keeps the supported-model section concise and leaves the full managed-model workflow to the docs.
Use these for model selection, cache management, and API details:
./zig-out/bin/zinc -m /path/to/model.gguf --prompt "The capital of France is"Start the server — no --prompt flag means server mode:
./zig-out/bin/zinc -m /path/to/model.gguf -p 8080Then open http://localhost:8080/ in your browser for the built-in chat interface.
ZINC exposes an OpenAI-compatible API at /v1.
For the actual request examples and SDK usage, use the website docs instead of the README:
- Running ZINC for CLI, server mode, and first-run examples
- Serving HTTP API for
curl, OpenAI SDK examples, endpoint behavior, and response shapes
The built-in chat UI is served at /, the API is under /v1, and the health endpoint is /health.
For building, testing, debugging, benchmarking, graph export, and contributing — see the Development Guide (web version).
Quick start:
zig build -Doptimize=ReleaseFast # build
zig build test # run all tests
./zig-out/bin/zinc --check # verify GPU/runtime setupSee also: CONTRIBUTING.md · Code of Conduct
The tables below are pulled directly from the latest published artifact at zolotukhin.ai/zinc/benchmarks. Latest refresh: 2026-06-01 UTC. Numbers are median tok/s across the suite's runs on a fresh boot, ZINC and llama.cpp on the same hardware, weights, and prompt.
| Model | ZINC prefill | llama.cpp prefill | ZINC % | ZINC decode | llama.cpp decode | ZINC % |
|---|---|---|---|---|---|---|
| Qwen 3.6 35B A3B UD Q4_K_XL | 154.27 | 398.82 | 39% | 127.90 | 108.52 | 118% |
| Qwen 3.5 9B Q4_K_M | 100.79 | 548.94 | 18% | 95.39 | 85.51 | 112% |
| Gemma 4 26B-A4B MoE Q4_K_M | 89.10 | 497.08 | 18% | 89.73 | 102.00 | 88% |
| Qwen 3.6 27B Dense Q4_K_M | 49.00 | 185.14 | 26% | 28.47 | 30.65 | 93% |
| Gemma 4 31B Q4_K_M | 41.64 | 201.97 | 21% | 24.65 | 28.55 | 86% |
| Model | ZINC prefill | llama.cpp prefill | ZINC % | ZINC decode | llama.cpp decode | ZINC % |
|---|---|---|---|---|---|---|
| Gemma 4 31B Q4_K_M | 132.60 | 84.47 | 157% | 21.86 | 23.30 | 94% |
| Qwen 3.6 35B A3B UD Q4_K_XL | 97.40 | 151.50 | 64% | 81.07 | 73.09 | 111% |
| Qwen 3.6 27B Dense Q4_K_M | 9.60 | 27.87 | 34% | 8.93 | 23.32 | 38% |
| Qwen 3.5 9B Q4_K_M | 23.20 | 90.65 | 26% | 23.21 | 66.52 | 35% |
| Gemma 4 26B-A4B MoE Q4_K_M | 34.00 | 365.45 | 9% | 30.01 | 88.44 | 34% |
- Ahead of llama.cpp: RDNA4 decode for Qwen 3.6 35B-A3B (1.18x) and Qwen 3.5 9B (1.12x); Metal decode for Qwen 3.6 35B-A3B (1.11x); Metal prefill for Gemma 4 31B (1.57x).
- Within striking distance (>=90% of llama.cpp): RDNA4 decode for Qwen 3.6 27B dense (93%); Metal decode for Gemma 4 31B (94%).
- Active gap: RDNA4 prefill is still the primary gap. Qwen 3.6 35B-A3B reaches 154.27 tok/s, but llama.cpp is 398.82 tok/s on the same box, so structural batched-prefill work remains the main unlock.
- In flight: Metal is mixed by model. Gemma 4 31B prefill and Qwen 3.6 35B decode are ahead of llama.cpp, while Qwen 3.5/3.6 dense and Gemma 4 26B still need backend-specific tuning.
For local benchmark commands, harnesses, and methodology, see:
| Component | Status |
|---|---|
| Vulkan infrastructure | Done |
| GGUF parser + model loader | Done |
| GPU detection (RDNA3/4) | Done |
| Native BPE tokenizer (from GGUF) | Done |
| GLSL compute shaders (16) | Done |
| Compute graph + architecture builders | Done |
| Forward pass (decode loop) | Working — 127.90 tok/s on RDNA4 and 33.86 tok/s on Apple M4 Max for Qwen 3.6 35B-A3B |
| Forward pass (prefill loop) | Working — 154.27 tok/s on RDNA4 short-context for Qwen 3.6 35B-A3B; Metal prefill is fast on Qwen 3 8B and Gemma 4 31B but uneven across the catalog |
| GPU SSM shaders + cmd batching | Done — RDNA decode is 127.90 tok/s on Qwen 3.6 35B |
| HTTP server + OpenAI API | Done — Qwen 35B-A3B raw API ~100 tok/s on RDNA4 and Metal server path in progress |
| Continuous batching | Phase 4 |
| TurboQuant KV compression | Phase 5 |
Validated on AMD Radeon AI PRO R9700 (RDNA4): Vulkan 1.3 init, GGUF parsing, 21 GB model loaded to VRAM, 723-node MoE graph built, coherent inference output verified against CPU reference.
The next push is closing the prefill gap to llama.cpp on hybrid MoE-plus-SSM models:
- Wire
mul_mm_q4kinto SSM proj prefill — the tiled Q4_K GEMM is in the tree but only routes the language-model head where N=1 wastes the BN tile. The SSM proj fires 4 DMMVs per layer per token; batching them across the prompt is the deferred cycle-40 refactor. - Port the
gated_delta_net.cublock-resident state pattern — today every prompt token re-reads and re-writes the full 2 MB SSM state per layer. Loading state once per workgroup and walking all tokens inside the kernel collapses 18 GB of state DRAM traffic per prefill to 4 MB. - Open
canUseBatchedPrefillRdnafor MoE+SSM hybrids — the entire batched prefill body (flash_attn_batched,rope_batched,dmmv_q4k_batch_kpar) is gated off whenn_experts > 0orssm_d_inner > 0. Once items 1 and 2 land, dropping the gate activates Br-row attention batching on the same workload. - Land the cycle-50 micro-restructure pattern on MoE inner loops — wider threads-per-row plus halved per-thread register slabs lifted ssm_delta_net by 2.7%. The same shape change is untried on
dmmv_q4k_moe_kparanddmmv_q4k_moe_fused_down_acc. - Ship batched Metal prefill across the catalog — Qwen 3 8B and Gemma 4 31B now show the fast path, but Qwen 3.6 27B/35B and Gemma 4 26B still need the same treatment.
The full plan and 50-cycle field report is in the cycle-50 blog post.
MIT