zLLM

A lightweight, high-performance LLM inference engine for macOS Apple Silicon, built with Zig 0.16.0 and Metal.

Features

Native Metal Acceleration: Custom Metal Shading Language (MSL) kernels for optimized tensor operations (RMSNorm, Embedding, MatMul).
Zig 0.16 "Juicy Main": Leveraging the latest Zig standard library features including std.process.Init and the unified std.Io interface.
Zero-Copy Architecture: Uses mmap to map GGUF model files directly into Metal buffers, minimizing CPU-GPU data transfer overhead.
GGUF Support: Pure Zig parser for the GGUF file format, supporting metadata and tensor information extraction.
Quantization: Built-in support for GGUF quantization formats like Q4_K and Q8_0 implemented directly in GPU shaders.
Zero Dependencies: Built entirely with Zig and macOS system frameworks (Foundation, Metal, QuartzCore). No heavy external libraries required.

Prerequisites

macOS (with Apple Silicon M1/M2/M3/M4 recommended)
Zig 0.16.0

Quick Start

1. Clone the repository

git clone https://github.com/jiacai2050/zllm.git
cd zllm

2. Download the Recommended Model

For the best experience (clean English output), download the Qwen2.5-0.5B-Instruct FP16 model:

Download Link: qwen2.5-0.5b-instruct-fp16.gguf

3. Build and Run

Direct Prompt Mode

zig build -Doptimize=ReleaseFast
./zig-out/bin/zllm /path/to/qwen2.5-0.5b-instruct-fp16.gguf -p "What is the capital of France?"

Interactive Chat Mode

./zig-out/bin/zllm /path/to/qwen2.5-0.5b-instruct-fp16.gguf

Project Structure

src/main.zig: Entry point using Zig 0.16 Juicy Main.
src/gguf.zig: Pure Zig GGUF format parser.
src/metal.zig: Zig wrapper for the Metal C++ bridge.
src/engine.zig: Inference engine logic and memory management.
src/metal/:
- bridge.mm: Objective-C++ bridge to Metal API.
- kernels.metal: High-performance GPU compute kernels.

Implementation Details

zLLM follows a Static Memory Allocation strategy. All required buffers (weights, KV cache, activations) are calculated and allocated during model loading. This ensures zero runtime allocations during the inference loop, providing predictable and low-latency performance.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.tool-versions		.tool-versions
CLAUDE.md		CLAUDE.md
GEMINI.md		GEMINI.md
README.md		README.md
build.zig		build.zig
build.zig.zon		build.zig.zon

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

zLLM

Features

Prerequisites

Quick Start

1. Clone the repository

2. Download the Recommended Model

3. Build and Run

Direct Prompt Mode

Interactive Chat Mode

Project Structure

Implementation Details

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

zLLM

Features

Prerequisites

Quick Start

1. Clone the repository

2. Download the Recommended Model

3. Build and Run

Direct Prompt Mode

Interactive Chat Mode

Project Structure

Implementation Details

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages