ZINC — Zig INferenCe Engine

Local LLM inference on consumer GPUs and Apple Silicon — no ROCm, no MLX, one binary.

35B parameter model running locally — Zig + Vulkan/Metal, no ROCm, no MLX

Supported Platforms

Platform	GPU	Backend	Status
Linux	AMD RDNA4 (RX 9070, AI PRO R9700)	Vulkan	Primary — hand-tuned shaders
Linux	AMD RDNA3 (RX 7900 XTX, etc.)	Vulkan	Supported
Linux	Intel Arc Xe2 / Battlemage	Vulkan	Experimental bring-up
macOS	Apple Silicon (M1, M2, M3, M4, M5)	Metal	Supported — native MSL shaders

ZINC focuses on current local-inference models people are actively running: Qwen 3.5/3.6 and Gemma 4 today, with a managed catalog that stays narrow on purpose. Older Llama/Mistral/Gemma generations may work eventually, but broad legacy-model coverage is not the main optimization target.

Status vs llama.cpp

Latest checked-in benchmark artifact, same machine, same weights, same prompt:

Platform	Compared models	Decode vs llama.cpp	Prefill vs llama.cpp	Read this as
AMD RDNA4 / Vulkan	5	99% avg, 2 model wins	24% avg	Decode is close; prefill is the active gap
Apple Silicon / Metal	5	62% avg, 1 model win	58% avg, 2 model wins	Mixed by model; tuning is still uneven
Intel Arc / Vulkan	1	84%	152%	Experimental, not full-catalog coverage yet

Full per-model numbers are in Benchmarks and on the public dashboard: zolotukhin.ai/zinc/benchmarks.

Start Here

Works the same on Linux (AMD GPU) and macOS (Apple Silicon):

git clone https://github.com/zolotukhin/zinc.git
cd zinc
zig build -Doptimize=ReleaseFast

# On RDNA4 Linux, enable cooperative matrix
export RADV_PERFTEST=coop_matrix  # skip on macOS

# Verify GPU, shaders, and runtime
./zig-out/bin/zinc --check

# See which models fit this machine
./zig-out/bin/zinc model list

# Download a model
./zig-out/bin/zinc model pull qwen35-9b-q4k-m

# Run a prompt (--chat applies the model's chat template for instruct models)
./zig-out/bin/zinc --model-id qwen35-9b-q4k-m --prompt "Hello" --chat

# Or open the chat UI in your browser
./zig-out/bin/zinc chat

The server exposes the built-in chat UI at / and an OpenAI-compatible API at /v1.

What Works Today

ZINC is usable today as a local, single-user inference engine for the validated models listed below.

Area	What you can do today
Run models	Use the CLI for single-stream inference on supported GGUF models
Chat	Start the built-in browser UI with `zinc chat`, including streaming and thinking-mode display
API	Serve OpenAI-compatible `/v1` endpoints with streaming responses
Models	Manage catalog models with `list`, `pull`, `use`, `active`, and `rm`
AMD GPUs	Run the Vulkan backend with RDNA-tuned wave64, cooperative-matrix, and fused-op shaders
Apple Silicon	Run the native Metal backend with MSL shaders, zero-copy mmap, and simdgroup ops
Setup	Let ZINC select the available backend at build time: Vulkan on Linux, Metal on macOS

Still Rough

Continuous batching and multi-tenant serving are still roadmap work
The supported-model list is intentionally narrow
Apple Silicon performance tuning is ongoing (RDNA4 path is more mature)

The Problem

Consumer GPUs have the hardware for fast LLM inference — bandwidth, compute, VRAM — but the software doesn't use it:

AMD RDNA3/RDNA4: ROCm doesn't support them. vLLM requires ROCm. llama.cpp's Vulkan path has no RDNA-specific tuning. These $500–1500 cards sit idle.
Apple Silicon: MLX and llama.cpp Metal work, but leave performance on the table. No engine is built from scratch around Metal's strengths (unified memory, simdgroup ops, zero-copy mmap).

The Solution

ZINC builds an inference engine tuned for the hardware you actually have.

Hand-tuned shaders for each platform. On AMD: wave64, cooperative matrix, architecture-aware tiling via Vulkan compute. On Apple Silicon: native MSL kernels with simdgroup reductions, zero-copy model loading, and Metal pipeline tuning. Not a generic backend that happens to run — built to extract real performance from each GPU.

One binary, no driver stack. No ROCm, no CUDA, no Python. Build with Zig, point at a GGUF, run inference. The right backend (Vulkan or Metal) is selected automatically at build time.

Drop-in compatible. OpenAI-compatible API, built-in chat UI, managed model catalog. Point your existing client at it and it works.

Supported Models

The list below matches the current managed model catalog, not a broader wishlist.

Qwen 3.5 9B Q4_K_M — supported on AMD RDNA4 16/32 GB, Apple Silicon, and Intel Arc
Qwen3.6 35B-A3B UD Q4_K_XL — supported on AMD RDNA4 32 GB and Apple Silicon
Qwen3.6 27B Dense Q4_K_M — experimental on AMD RDNA4 32 GB and Apple Silicon
Gemma 4 31B Q4_K_M — supported on AMD RDNA4 32 GB and Apple Silicon
Gemma 4 26B-A4B MoE Q4_K_M — supported on AMD RDNA4 32 GB and Apple Silicon
Use zinc model list --json for machine-readable model metadata
Current throughput and latency numbers live on the public benchmarks page: zolotukhin.ai/zinc/benchmarks

Quantization formats: Q4_K, Q5_K, Q6_K, Q8_0, Q5_0, MXFP4, F16, F32

Quick Start

Prerequisites

Tool	Install
Zig 0.15.2+	ziglang.org/download
Vulkan loader + tools	`apt install libvulkan-dev vulkan-tools` (Linux) or `brew install vulkan-loader vulkan-headers` (macOS)
`glslc` on Linux	`apt install glslc`
Bun for tests and the docs site	`curl -fsSL https://bun.sh/install \| bash`

Important: On Linux with RDNA4, newer glslc releases can cause a large regression. Use the system package version.

Build ZINC

git clone https://github.com/zolotukhin/zinc.git
cd zinc

# Build the CLI and server
# macOS: shaders are skipped
# Linux: shaders are compiled automatically
zig build -Doptimize=ReleaseFast

The binary is placed in zig-out/bin/zinc. Compiled SPIR-V shaders go to zig-out/share/zinc/shaders/. Use ReleaseFast for any performance measurement or server deployment. Plain zig build is not a fair throughput baseline.

Run a Preflight Check First

Before your first prompt, run --check. The target state is a clean READY [OK] run with no warnings.

# General machine + Vulkan + shader preflight
./zig-out/bin/zinc --check

# Recommended on RDNA4 before measuring performance
export RADV_PERFTEST=coop_matrix
./zig-out/bin/zinc --check

# Check one exact GGUF file
./zig-out/bin/zinc --check -m /path/to/model.gguf

# Check one managed catalog model by id
./zig-out/bin/zinc --check --model-id qwen36-35b-a3b-q4k-xl

--check verifies:

host environment and RDNA4-specific shell hints
compiled shader assets
Vulkan device discovery and the selected GPU
GGUF metadata when you pass -m /path/to/model.gguf
managed-model compatibility when you pass --model-id <id>
estimated single-GPU VRAM fit for the current runtime

If --check reports warnings, treat them as setup work to finish before judging runtime behavior. For the full walkthrough, see Running ZINC and Hardware requirements.

Choosing Models

The README keeps the supported-model section concise and leaves the full managed-model workflow to the docs.

Use these for model selection, cache management, and API details:

Run a Prompt

./zig-out/bin/zinc -m /path/to/model.gguf --prompt "The capital of France is"

Run the Server

Start the server — no --prompt flag means server mode:

./zig-out/bin/zinc -m /path/to/model.gguf -p 8080

Then open http://localhost:8080/ in your browser for the built-in chat interface.

Use the API

ZINC exposes an OpenAI-compatible API at /v1.

For the actual request examples and SDK usage, use the website docs instead of the README:

Running ZINC for CLI, server mode, and first-run examples
Serving HTTP API for curl, OpenAI SDK examples, endpoint behavior, and response shapes

The built-in chat UI is served at /, the API is under /v1, and the health endpoint is /health.

Development

For building, testing, debugging, benchmarking, graph export, and contributing — see the Development Guide (web version).

Quick start:

zig build -Doptimize=ReleaseFast   # build
zig build test                      # run all tests
./zig-out/bin/zinc --check          # verify GPU/runtime setup

See also: CONTRIBUTING.md · Code of Conduct

Architecture

Benchmarks

The tables below are pulled directly from the latest published artifact at zolotukhin.ai/zinc/benchmarks. Latest refresh: 2026-06-01 UTC. Numbers are median tok/s across the suite's runs on a fresh boot, ZINC and llama.cpp on the same hardware, weights, and prompt.

AMD RDNA4 — Radeon AI PRO R9700 (Vulkan)

Model	ZINC prefill	llama.cpp prefill	ZINC %	ZINC decode	llama.cpp decode	ZINC %
Qwen 3.6 35B A3B UD Q4_K_XL	154.27	398.82	39%	127.90	108.52	118%
Qwen 3.5 9B Q4_K_M	100.79	548.94	18%	95.39	85.51	112%
Gemma 4 26B-A4B MoE Q4_K_M	89.10	497.08	18%	89.73	102.00	88%
Qwen 3.6 27B Dense Q4_K_M	49.00	185.14	26%	28.47	30.65	93%
Gemma 4 31B Q4_K_M	41.64	201.97	21%	24.65	28.55	86%

Apple Silicon M4 Max (Metal)

Model	ZINC prefill	llama.cpp prefill	ZINC %	ZINC decode	llama.cpp decode	ZINC %
Gemma 4 31B Q4_K_M	132.60	84.47	157%	21.86	23.30	94%
Qwen 3.6 35B A3B UD Q4_K_XL	97.40	151.50	64%	81.07	73.09	111%
Qwen 3.6 27B Dense Q4_K_M	9.60	27.87	34%	8.93	23.32	38%
Qwen 3.5 9B Q4_K_M	23.20	90.65	26%	23.21	66.52	35%
Gemma 4 26B-A4B MoE Q4_K_M	34.00	365.45	9%	30.01	88.44	34%

Where we stand vs llama.cpp

Ahead of llama.cpp: RDNA4 decode for Qwen 3.6 35B-A3B (1.18x) and Qwen 3.5 9B (1.12x); Metal decode for Qwen 3.6 35B-A3B (1.11x); Metal prefill for Gemma 4 31B (1.57x).
Within striking distance (>=90% of llama.cpp): RDNA4 decode for Qwen 3.6 27B dense (93%); Metal decode for Gemma 4 31B (94%).
Active gap: RDNA4 prefill is still the primary gap. Qwen 3.6 35B-A3B reaches 154.27 tok/s, but llama.cpp is 398.82 tok/s on the same box, so structural batched-prefill work remains the main unlock.
In flight: Metal is mixed by model. Gemma 4 31B prefill and Qwen 3.6 35B decode are ahead of llama.cpp, while Qwen 3.5/3.6 dense and Gemma 4 26B still need backend-specific tuning.

For local benchmark commands, harnesses, and methodology, see:

Current Status

Component	Status
Vulkan infrastructure	Done
GGUF parser + model loader	Done
GPU detection (RDNA3/4)	Done
Native BPE tokenizer (from GGUF)	Done
GLSL compute shaders (16)	Done
Compute graph + architecture builders	Done
Forward pass (decode loop)	Working — 127.90 tok/s on RDNA4 and 33.86 tok/s on Apple M4 Max for Qwen 3.6 35B-A3B
Forward pass (prefill loop)	Working — 154.27 tok/s on RDNA4 short-context for Qwen 3.6 35B-A3B; Metal prefill is fast on Qwen 3 8B and Gemma 4 31B but uneven across the catalog
GPU SSM shaders + cmd batching	Done — RDNA decode is 127.90 tok/s on Qwen 3.6 35B
HTTP server + OpenAI API	Done — Qwen 35B-A3B raw API ~100 tok/s on RDNA4 and Metal server path in progress
Continuous batching	Phase 4
TurboQuant KV compression	Phase 5

Validated on AMD Radeon AI PRO R9700 (RDNA4): Vulkan 1.3 init, GGUF parsing, 21 GB model loaded to VRAM, 723-node MoE graph built, coherent inference output verified against CPU reference.

Next Steps

The next push is closing the prefill gap to llama.cpp on hybrid MoE-plus-SSM models:

Wire mul_mm_q4k into SSM proj prefill — the tiled Q4_K GEMM is in the tree but only routes the language-model head where N=1 wastes the BN tile. The SSM proj fires 4 DMMVs per layer per token; batching them across the prompt is the deferred cycle-40 refactor.
Port the gated_delta_net.cu block-resident state pattern — today every prompt token re-reads and re-writes the full 2 MB SSM state per layer. Loading state once per workgroup and walking all tokens inside the kernel collapses 18 GB of state DRAM traffic per prefill to 4 MB.
Open canUseBatchedPrefillRdna for MoE+SSM hybrids — the entire batched prefill body (flash_attn_batched, rope_batched, dmmv_q4k_batch_kpar) is gated off when n_experts > 0 or ssm_d_inner > 0. Once items 1 and 2 land, dropping the gate activates Br-row attention batching on the same workload.
Land the cycle-50 micro-restructure pattern on MoE inner loops — wider threads-per-row plus halved per-thread register slabs lifted ssm_delta_net by 2.7%. The same shape change is untried on dmmv_q4k_moe_kpar and dmmv_q4k_moe_fused_down_acc.
Ship batched Metal prefill across the catalog — Qwen 3 8B and Gemma 4 31B now show the fast path, but Qwen 3.6 27B/35B and Gemma 4 26B still need the same treatment.

The full plan and 50-cycle field report is in the cycle-50 blog post.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4,283 Commits
.agents/skills		.agents/skills
.claude		.claude
.github		.github
.specify		.specify
assets		assets
benchmarks		benchmarks
docs		docs
loops		loops
research		research
scripts		scripts
site		site
specs		specs
src		src
tests		tests
tools		tools
writing		writing
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
build.zig		build.zig
build.zig.zon		build.zig.zon
flake.lock		flake.lock
flake.nix		flake.nix
gaps.html		gaps.html
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ZINC — Zig INferenCe Engine

Supported Platforms

Status vs llama.cpp

Start Here

What Works Today

Still Rough

The Problem

The Solution

Supported Models

Quick Start

Prerequisites

Build ZINC

Run a Preflight Check First

Choosing Models

Run a Prompt

Run the Server

Use the API

Development

Architecture

Benchmarks

AMD RDNA4 — Radeon AI PRO R9700 (Vulkan)

Apple Silicon M4 Max (Metal)

Where we stand vs llama.cpp

Current Status

Next Steps

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ZINC — Zig INferenCe Engine

Supported Platforms

Status vs llama.cpp

Start Here

What Works Today

Still Rough

The Problem

The Solution

Supported Models

Quick Start

Prerequisites

Build ZINC

Run a Preflight Check First

Choosing Models

Run a Prompt

Run the Server

Use the API

Development

Architecture

Benchmarks

AMD RDNA4 — Radeon AI PRO R9700 (Vulkan)

Apple Silicon M4 Max (Metal)

Where we stand vs llama.cpp

Current Status

Next Steps

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages