1 unstable release
| new 0.1.0 | May 8, 2026 |
|---|
#1919 in Artificial intelligence
Used in 2 crates
49KB
1K
SLoC
wax
wax is a small Rust-native LLM inference engine built on
Candle.
It is intentionally narrow: load a local model, run a decoder-only Llama-like causal LM, stream tokens, measure performance, and keep the implementation easy to read.
Features
- Local inference from the command line.
- Safetensors model folders with
config.jsonandtokenizer.json. - Direct
.ggufmodel files through Candle's quantized Llama backend. - Token streaming to stdout.
- Greedy, temperature, top-k, top-p, and repetition-penalty sampling.
- EOS and max-token stopping.
- Basic timing and throughput stats.
- JSON benchmark output.
- CPU, Metal, CUDA, and Accelerate feature flags.
- MLX model folder detection with a clear conversion error.
Status
This project is early and intentionally limited.
| Area | Status |
|---|---|
| Safetensors Llama-like causal LM | Supported |
| GGUF Llama-family models | Supported |
| MLX model folders | Detected, not directly executable |
| OpenAI-compatible HTTP server | Not implemented |
| GGUF conversion | Not implemented |
| Quantization beyond GGUF backend | Not implemented |
| Batching / PagedAttention | Not implemented |
| Multimodal models | Not implemented |
MLX note: Candle does not directly execute MLX weight folders. Convert MLX
models to Hugging Face safetensors or GGUF before using them with wax.
Requirements
- Rust 1.94 or newer.
- A local model in one of the supported formats.
- macOS with Apple Silicon for the
metalfeature, or a CUDA environment for thecudafeature.
Build
CPU build:
cargo build -p wax-llm --release
Metal build on macOS:
cargo build -p wax-llm --release --features metal
Install the wax binary from this checkout:
cargo install --path crates/wax-llm --features metal
The package name is wax-llm; the installed binary is wax.
Quickstart
Download a small safetensors model:
mkdir -p models/TinyLlama-1.1B-Chat-v1.0
hf download TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
config.json \
tokenizer.json \
tokenizer_config.json \
generation_config.json \
model.safetensors \
--local-dir models/TinyLlama-1.1B-Chat-v1.0
Run generation with Metal:
cargo run -p wax-llm --features metal -- run \
--model ./models/TinyLlama-1.1B-Chat-v1.0 \
--prompt "Explain Rust ownership simply" \
--max-new-tokens 128 \
--temperature 0.7 \
--top-p 0.9 \
--stream
After cargo install, the same command is:
wax run \
--model ./models/TinyLlama-1.1B-Chat-v1.0 \
--prompt "Explain Rust ownership simply" \
--max-new-tokens 128 \
--temperature 0.7 \
--top-p 0.9 \
--stream
GGUF
Download a small GGUF model:
mkdir -p models/gguf-smollm2-360m
hf download HuggingFaceTB/SmolLM2-360M-Instruct-GGUF \
smollm2-360m-instruct-q8_0.gguf \
--local-dir models/gguf-smollm2-360m
Run it directly:
cargo run -p wax-llm --features metal -- run \
--model ./models/gguf-smollm2-360m/smollm2-360m-instruct-q8_0.gguf \
--prompt "Say hello" \
--max-new-tokens 64 \
--temperature 0 \
--stream
For GGUF, wax uses tokenizer.json next to the model if present. If not, it
tries to build a tokenizer from GGUF metadata.
CLI
Run text generation:
wax run \
--model ./models/my-model \
--prompt "Hello" \
--max-new-tokens 64 \
--temperature 0.7 \
--top-k 40 \
--top-p 0.9 \
--repetition-penalty 1.1 \
--seed 42 \
--device auto \
--dtype auto \
--stream
Benchmark a prompt:
wax bench \
--model ./models/my-model \
--prompt-file prompts/short.txt \
--runs 5 \
--max-new-tokens 128 \
--json
Device options:
auto | cpu | cuda | metal
DType options:
auto | f32 | f16 | bf16
For GGUF models, stats report dtype: "gguf" because the model's quantized
weight format is determined by the GGUF file.
Model Layouts
Safetensors folder:
model/
├── config.json
├── tokenizer.json
├── tokenizer_config.json
├── generation_config.json
├── model.safetensors
└── model.safetensors.index.json
Only one of model.safetensors or model.safetensors.index.json is required.
GGUF:
model.gguf
or:
model/
├── model.gguf
└── tokenizer.json
If a folder contains multiple .gguf files, rename the intended file to
model.gguf or pass the exact .gguf path.
Architecture
wax
├── wax-core # loading, tokenization, generation, sampling, stats
├── wax-llm # CLI package, installs the `wax` binary
└── wax-bench # shared benchmark types/helpers
The core crate is intentionally independent of HTTP/server dependencies.
Development
Run the default test suite:
cargo test --workspace --no-default-features
Run the Metal feature build:
cargo test --workspace --features metal
Run formatting and lint checks:
cargo fmt --check
cargo clippy --workspace --all-targets --no-default-features -- -D warnings
Current tests cover loader format detection, safetensors index handling, MLX detection, CLI argument behavior, sampling, stats serialization, device/dtype selection, and token streaming.
Releasing
Crates are published to crates.io by the Publish crates GitHub Actions
workflow when a GitHub Release is published from main.
Release requirements:
- Set the repository secret
CARGO_REGISTRY_TOKENto a crates.io API token. - Bump the workspace version in
Cargo.toml. - Create a GitHub Release whose tag matches the version, for example
v0.1.0. - Create the release from
main.
The workflow publishes crates in dependency order:
wax-corewax-benchwax-llm
The package name is wax-llm, and the installed binary is wax.
Contributing
Small, focused changes are preferred. Please keep the core inference path simple and measurable.
Before opening a PR, run:
cargo fmt --check
cargo test --workspace --no-default-features
cargo clippy --workspace --all-targets --no-default-features -- -D warnings
If your change touches GPU-specific behavior, also run the relevant feature build, for example:
cargo test --workspace --features metal
License
Checkout the full license here.
Dependencies
~40MB
~713K SLoC