1 unstable release

new 0.1.0	May 8, 2026

#1919 in Artificial intelligence

Used in 2 crates

MIT/Apache

49KB
1K SLoC

wax

wax is a small Rust-native LLM inference engine built on Candle.

It is intentionally narrow: load a local model, run a decoder-only Llama-like causal LM, stream tokens, measure performance, and keep the implementation easy to read.

Features

Local inference from the command line.
Safetensors model folders with config.json and tokenizer.json.
Direct .gguf model files through Candle's quantized Llama backend.
Token streaming to stdout.
Greedy, temperature, top-k, top-p, and repetition-penalty sampling.
EOS and max-token stopping.
Basic timing and throughput stats.
JSON benchmark output.
CPU, Metal, CUDA, and Accelerate feature flags.
MLX model folder detection with a clear conversion error.

Status

This project is early and intentionally limited.

Area	Status
Safetensors Llama-like causal LM	Supported
GGUF Llama-family models	Supported
MLX model folders	Detected, not directly executable
OpenAI-compatible HTTP server	Not implemented
GGUF conversion	Not implemented
Quantization beyond GGUF backend	Not implemented
Batching / PagedAttention	Not implemented
Multimodal models	Not implemented

MLX note: Candle does not directly execute MLX weight folders. Convert MLX models to Hugging Face safetensors or GGUF before using them with wax.

Requirements

Rust 1.94 or newer.
A local model in one of the supported formats.
macOS with Apple Silicon for the metal feature, or a CUDA environment for the cuda feature.

Build

CPU build:

cargo build -p wax-llm --release

Metal build on macOS:

cargo build -p wax-llm --release --features metal

Install the wax binary from this checkout:

cargo install --path crates/wax-llm --features metal

The package name is wax-llm; the installed binary is wax.

Quickstart

Download a small safetensors model:

mkdir -p models/TinyLlama-1.1B-Chat-v1.0

hf download TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  config.json \
  tokenizer.json \
  tokenizer_config.json \
  generation_config.json \
  model.safetensors \
  --local-dir models/TinyLlama-1.1B-Chat-v1.0

Run generation with Metal:

cargo run -p wax-llm --features metal -- run \
  --model ./models/TinyLlama-1.1B-Chat-v1.0 \
  --prompt "Explain Rust ownership simply" \
  --max-new-tokens 128 \
  --temperature 0.7 \
  --top-p 0.9 \
  --stream

After cargo install, the same command is:

wax run \
  --model ./models/TinyLlama-1.1B-Chat-v1.0 \
  --prompt "Explain Rust ownership simply" \
  --max-new-tokens 128 \
  --temperature 0.7 \
  --top-p 0.9 \
  --stream

GGUF

Download a small GGUF model:

mkdir -p models/gguf-smollm2-360m

hf download HuggingFaceTB/SmolLM2-360M-Instruct-GGUF \
  smollm2-360m-instruct-q8_0.gguf \
  --local-dir models/gguf-smollm2-360m

Run it directly:

cargo run -p wax-llm --features metal -- run \
  --model ./models/gguf-smollm2-360m/smollm2-360m-instruct-q8_0.gguf \
  --prompt "Say hello" \
  --max-new-tokens 64 \
  --temperature 0 \
  --stream

For GGUF, wax uses tokenizer.json next to the model if present. If not, it tries to build a tokenizer from GGUF metadata.

CLI

Run text generation:

wax run \
  --model ./models/my-model \
  --prompt "Hello" \
  --max-new-tokens 64 \
  --temperature 0.7 \
  --top-k 40 \
  --top-p 0.9 \
  --repetition-penalty 1.1 \
  --seed 42 \
  --device auto \
  --dtype auto \
  --stream

Benchmark a prompt:

wax bench \
  --model ./models/my-model \
  --prompt-file prompts/short.txt \
  --runs 5 \
  --max-new-tokens 128 \
  --json

Device options:

auto | cpu | cuda | metal

DType options:

auto | f32 | f16 | bf16

For GGUF models, stats report dtype: "gguf" because the model's quantized weight format is determined by the GGUF file.

Model Layouts

Safetensors folder:

model/
├── config.json
├── tokenizer.json
├── tokenizer_config.json
├── generation_config.json
├── model.safetensors
└── model.safetensors.index.json

Only one of model.safetensors or model.safetensors.index.json is required.

GGUF:

model.gguf

or:

model/
├── model.gguf
└── tokenizer.json

If a folder contains multiple .gguf files, rename the intended file to model.gguf or pass the exact .gguf path.

Architecture

wax
├── wax-core   # loading, tokenization, generation, sampling, stats
├── wax-llm    # CLI package, installs the `wax` binary
└── wax-bench  # shared benchmark types/helpers

The core crate is intentionally independent of HTTP/server dependencies.

Development

Run the default test suite:

cargo test --workspace --no-default-features

Run the Metal feature build:

cargo test --workspace --features metal

Run formatting and lint checks:

cargo fmt --check
cargo clippy --workspace --all-targets --no-default-features -- -D warnings

Current tests cover loader format detection, safetensors index handling, MLX detection, CLI argument behavior, sampling, stats serialization, device/dtype selection, and token streaming.

Releasing

Crates are published to crates.io by the Publish crates GitHub Actions workflow when a GitHub Release is published from main.

Release requirements:

Set the repository secret CARGO_REGISTRY_TOKEN to a crates.io API token.
Bump the workspace version in Cargo.toml.
Create a GitHub Release whose tag matches the version, for example v0.1.0.
Create the release from main.

The workflow publishes crates in dependency order:

wax-core
wax-bench
wax-llm

The package name is wax-llm, and the installed binary is wax.

Contributing

Small, focused changes are preferred. Please keep the core inference path simple and measurable.

Before opening a PR, run:

cargo fmt --check
cargo test --workspace --no-default-features
cargo clippy --workspace --all-targets --no-default-features -- -D warnings

If your change touches GPU-specific behavior, also run the relevant feature build, for example:

cargo test --workspace --features metal

License

Checkout the full license here.

Dependencies

~40MB
~713K SLoC