Generic streaming inference runtime for LLMs that exceed GPU memory. Runs large language models and leave them to reside on disk. Agents don't mind latency if researching, save the VRAM.
Developed on Apple Silicon MLX today; CUDA and ROCm on the roadmap for community hands (see ADR 0006). I plan to open the source.
Enhancements like speculative decoding, per-token logprob capture, and
replay bundles are pluggable opt-ins — not required for core
operation. And still in development. The core POC is closed
(ADR 0010);
spec-decode is the first pluggable enhancement
(ADR 0011,
PRD at docs/product/speculative-decoding-prd.md).
This is a monorepo. Three tracks ship side by side:
- Rust runtime (repo root) — the production lane. 13-crate
workspace under
crates/, driven by theaerollmCLI binary (cargo install --path crates/aerollm-cli). No Python interpreter required. - Python package (
python/aerollm/) — an optional ergonomics layer for Jupyter / MLX users. Depends on theaerollm-apiPyO3 wheel (pip install aerollm-api). Python is a compatibility surface, not a runtime — the Rust CLI path never loads Python. - Optional enhancements (
aerollm-speculativeand Phase R items) — pluggable subsystems opted into via--speculative/ Python kwargs / Rust generic-bound methods. The core runtime works end-to-end without them.
See each subtree's own README for track-specific docs:
crates/aerollm-api/README.md for the
PyO3 bindings, python/aerollm/README.md
for the Python package.
Status: Production-ready on a narrow slice for Apple Silicon: Qwen2/Qwen2.5 and Llama 3.x bf16-eager (passing 19/19 golden-output correctness gates vs
mlx_lm). In-process MLX backend with mmap shard reads, bounded LRU shard cache, double-buffered layer prefetch from NVMe, ring eviction with force-free triad (ADR 0014), speculative decoding (greedy + Leviathan-2022), temperature / top-k / top-p / min-p sampling with seeded RNGs, replay bundles, and a message-bus telemetry surface. Rust CLI and Python wheel both ship. Empirically demonstrated frontier-scale: Llama-3.1-405B-4bit (215 GB on disk) on a 36 GB M5 Max with peak RSS ~12 GB, 0.004 tok/s, EXIT 0 — see docs/COMPATIBILITY.md §Tunables that actually matter for the recipe.What aeroLLM is NOT today. Not interactive at 70B+ (Llama-3.1-405B-4bit measured at 0.004 tok/s — built for offline / overnight batch, not chat). Not GA — four ADR 0007 acceptance gates remain hardware-pending. No CUDA or ROCm backend (scaffold only; community-welcome). No native MoE (GLM-4.5, DeepSeek-V3 are Phase 4 roadmap). 4-bit oracle drops to 10/19 bit-exact at N=128 due to three pre-existing
mlx_lmbf16 kernel-fusion fixtures (creative_poem,instruct_recipe,analysis_covid) — the 19/19 banner is bf16-eager-only. Phase A batched B=4 amortization is correctness-clean but fails its wall-time gates; aggregate throughput gain (~1.69× at 405B, 3.15× at 7B) overlaps with the shard-cache budget bump. See ADR 0016 for the full honest scope reset.Community welcome. CUDA and ROCm aren't shipped yet — they're planned as parallel native crates mirroring
aerollm-backend-mlx-native. ADR 0006 has the concrete contribution ladder (scaffold → Qwen2 port → more architectures).Detailed subsystem-level tracking in MILESTONES.md.
"Can it run X?" → docs/COMPATIBILITY.md is the canonical, honest answer (architecture × quant × backend × feature, native vs measured-via-other-engine vs roadmap, with what's actually been verified). When this README and the matrix disagree, the matrix wins.
"Will it run on my Mac?" → docs/what-runs-on-your-mac.md. Commodity-hardware reality table — what actually hits chat-grade speed on 8 / 16 / 24 / 36 / 48–64 / 96–128 GB unified memory, measured where measurable and honestly extrapolated elsewhere. Includes the explicit "what aeroLLM does not claim" list — the headline being that we do not position aeroLLM as "405 B on 8 GB VRAM at chat speed." That framing is technically true on the memory axis, grossly misleading on the speed axis, and not something this project endorses. The actual commodity-hardware headline: Qwen3-30B-A3B-4bit MoE chats at ~31 tok/s on a 24 GB Mac mini, measured.
Modern LLMs are outgrowing consumer hardware: a 70B-parameter dense model needs ~140 GB in bf16, a 355B-parameter MoE like GLM-4.5 needs ~710 GB. AeroLLM streams weights from NVMe into memory layer-by-layer, so the resident working set is bounded by the largest single layer (+ KV cache), not by the model.
The project is a multi-phase build from a tiny dense model on day one to a 1 TB-class MoE on a Mac Studio by Phase 4.
AeroLLM ships zero outbound network calls. All inference is local. Model downloads (via the user's own
huggingface-cli) are the only network activity in the project's surface. The runtime, sampler, spec-decode driver, and replay subsystem are fully offline.
AeroLLM streams weights layer-by-layer from disk. That requires a native architecture port plus a format it can mmap cheaply — so what you pick depends on which lane you're using.
Looking for the full compatibility matrix? See
docs/COMPATIBILITY.mdfor architecture × quantization × backend × feature support, plus known bugs and hardware-pending validations. The summary below covers the green production path; the doc covers the rest honestly.
The in-process mlx-native backend (--backend mlx-native,
macOS / Apple Silicon) is the production lane:
| Family | Sizes | Format | Status |
|---|---|---|---|
| Qwen2 / Qwen2.5 | 0.5B, 7B verified (larger sizes share the arch, unverified) | bf16 safetensors | Supported ✅ (19/19 gate) |
| Llama 3.x (3.1, 3.2, 3.3) | 1B verified (3B/8B/70B share the arch, unverified) | bf16 safetensors | Supported ✅ (19/19 gate) |
| Qwen2 / Qwen2.5 | any | 4-bit (mlx-community Q4) safetensors |
✅ Forward verified (token-identical to mlx_lm Q4); expect normal Q4 quality cost vs bf16 — see COMPATIBILITY.md |
Note on sizes: the correctness gates run on the small checkpoints (0.5B/7B Qwen, 1B Llama). Larger sizes use the identical architecture port and are expected to work, but aren't individually gate-verified — COMPATIBILITY.md is honest about exactly what's been run.
MoE? Not natively — aeroLLM is dense-only today. MoE is measured via other engines in the benchmark matrix, with native MoE as the 2.0 headline. Full detail: the MoE status section.
# Qwen2.5
aerollm generate --backend mlx-native \
--model /path/to/Qwen2.5-7B-Instruct \
--prompt "Summarize the last email I got."
# Llama 3.x
aerollm generate --backend mlx-native \
--model /path/to/Llama-3.1-8B-Instruct \
--prompt "Explain RoPE positional encoding."| Family | Phase | Notes |
|---|---|---|
| Mistral 7B / Mixtral-dense | 3 | Sliding-window attention variant |
| GLM-4.5 (MoE, 355B) | 4 | Full MoE router + per-expert streaming — P4 flagship |
| Llama 4 (MoE) | 4 | Reuses the GLM-4.5 MoE plumbing |
The aerollm Python package (python/aerollm/) wraps the Rust
runtime through the aerollm-api PyO3 wheel, so model support is
identical: Qwen2 and Llama 3.x today, MoE at Phase 4. Useful
from Jupyter or for one-off scripts; the Rust CLI remains the
production path.
That backend shells out to mlx-lm and loads the whole model into
memory on every call. It's still there for CI, A/B comparisons, and
as a --legacy-subprocess fallback (see
ADR 0003), and
it will run any architecture mlx-lm supports — but it does
not stream from disk. If streaming is what you're here for, use
mlx-native.
Not shipped — but also not stuck waiting on a particular release.
ADR 0006
decides CUDA lives as aerollm-backend-cuda, a new workspace crate
parallel to aerollm-backend-mlx-native, and a contribution ladder
breaks it into chunks:
- Scaffold the crate + Qwen2 port — unblocks Linux users
- Llama 2/3 port (small delta on Qwen2's skeleton)
- GGUF loader, more architectures
- ROCm via HIP (same pattern as CUDA)
The StreamingBackend trait in aerollm-backend is already
device-agnostic (audited 2026-04-23). A contributor can add CUDA
support without touching any existing crate's public API.
In the meantime, if you need streaming on a Linux GPU host today, AirLLM works — AeroLLM is a Rust/production successor but AirLLM's Python lane is still the only shipping option there.
AirLLM proved the layer-streaming thesis; AeroLLM is the production rewrite. The Python package in this monorepo descends from the AirLLM fork, but the point of the Rust workspace is to strip out the Python hot path AirLLM was built on:
- Single binary. The
aerollmCLI is a Rust process. Python bindings exist but are optional and off the hot path —aerollm generatenever loads an interpreter. - Deterministic lifecycle.
Init → Loading → Running → ShuttingDownis enforced by a state machine and observable on the in-process bus. No "did the model load yet?" ambiguity. - Real prefetch. A double-buffered worker (
aerollm-prefetch) reads layer N+1 from NVMe while the GPU is executing layer N. AirLLM loads each layer synchronously. - Sharded KV cache.
aerollm-kvusesparking_lot::RwLock+Condvarsharded to the number of attention heads — single-writer contention on long contexts is gone. - Two-tier arena. Persistent weights and per-request ephemeral
tensors live in separate bumpalos, so request teardown is a single
reset()call, not a heap walk.
AirLLM is fine on a research laptop. AeroLLM is where you go when you need the same trick to survive a long-running server.
flowchart LR
cli[aerollm-cli] --> rt[aerollm-runtime]
rt --> be[aerollm-backend<br/>trait]
be --> native[aerollm-backend-mlx-native<br/>in-process MLX · macOS]
be --> shim[aerollm-backend-mlx<br/>subprocess · CI / legacy]
be --> noop[NoOpBackend<br/>reference / Linux]
rt --> spec[aerollm-speculative<br/>greedy + Leviathan-2022]
rt --> bus[aerollm-bus<br/>lifecycle / telemetry]
rt --> kv[aerollm-kv<br/>sharded cache]
rt --> arena[aerollm-arena<br/>two-tier bumpalo]
native --> prefetch[aerollm-prefetch<br/>double-buffered NVMe]
native --> tok_q[aerollm-tokenizer-qwen2]
native --> tok_l[aerollm-tokenizer-llama]
All crates live under crates/. The CLI is a thin dispatcher;
aerollm-runtime drives the full lifecycle. aerollm-backend-mlx-native
is the production lane — in-process Metal via mlx-rs with double-buffered
layer streaming from NVMe.
# 1. Clone + build
git clone https://github.com/cdarnell/aerollm
cd aerollm
cargo build --release
# 2. Download a model (HuggingFace Hub, one-time)
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct \
--local-dir ~/models/Qwen2.5-0.5B-Instruct
# 3. Install the MLX runtime dependency (macOS, one-time)
pip install mlx-lm
# 4. Generate
./target/release/aerollm generate \
--backend mlx-native \
--model ~/models/Qwen2.5-0.5B-Instruct \
--prompt "Write one sentence about Apple Silicon."Expected: a streamed sentence about Apple Silicon on stdout.
The no-op backend works anywhere — handy for CI, Linux workstations, or just sanity-checking the wiring:
mkdir -p /tmp/aerollm-noop
./target/release/aerollm generate \
--backend noop \
--model /tmp/aerollm-noop \
--prompt "hello"
# Apple Silicon uses a unified-memory architecture.aerollm/ # monorepo root
├── crates/ # Rust workspace (production lane)
│ ├── aerollm-tensor/ # dtype, shape, tensor buffer
│ ├── aerollm-instrumentation/ # atomic counters + tracing
│ ├── aerollm-bus/ # in-process event bus (lifecycle + telemetry)
│ ├── aerollm-arena/ # two-tier bumpalo arena (persistent + ephemeral)
│ ├── aerollm-kv/ # sharded KV cache (parking_lot RwLock + Condvar)
│ ├── aerollm-prefetch/ # double-buffered weight prefetch worker
│ ├── aerollm-backend/ # Backend + StreamingBackend traits + NoOpBackend
│ ├── aerollm-backend-mlx/ # MLX subprocess shim (CI / --legacy-subprocess)
│ ├── aerollm-backend-mlx-native/ # in-process MLX via mlx-rs 0.25 (macOS, production)
│ ├── aerollm-backend-cuda/ # CUDA scaffold (community contribution target)
│ ├── aerollm-tokenizer-qwen2/ # Qwen2 BPE + ChatML template
│ ├── aerollm-tokenizer-llama/ # Llama 3.x BPE + chat template
│ ├── aerollm-speculative/ # spec-decode driver (greedy + Leviathan-2022)
│ ├── aerollm-runtime/ # AeroRuntime<B> orchestrator + LifecycleController
│ ├── aerollm-correctness/ # golden-fixture harness + perplexity regression suite
│ ├── aerollm-bench/ # criterion benches (KV / arena / generate / spec-decode)
│ ├── aerollm-cli/ # `aerollm` binary
│ └── aerollm-api/ # PyO3 bindings → wheel
├── python/
│ └── aerollm/ # Python package (ergonomics lane, optional)
│ ├── aerollm/mlx/ # adapter — delegates to aerollm_api
│ ├── pyproject.toml
│ └── tests/
├── docs/
│ ├── decisions/ # ADRs 0001–0007
│ └── benchmarks/ # AirLLM comparison plan + bench recipes
├── fuzz/ # cargo-fuzz targets (ChatML, safetensors, replay bundle)
├── tests/
│ ├── fixtures/ # Model weights (not checked in)
│ └── perplexity/ # 100-prompt perplexity regression fixture
├── api-baseline/ # cargo public-api snapshots (CI gate)
├── .github/workflows/ # CI: tests, clippy, fuzz, release, public-api
├── MILESTONES.md
├── MAINTAINERS.md
├── SECURITY.md
├── NOTICE
└── LICENSE # Apache-2.0
- ADR 0001 — Crate
prefix is
aerollm-*; product name remains AeroLLM. - ADR 0002 — In-process message bus deferred to Phase 1 for lifecycle + telemetry fan-out.
- ADR 0003
— Adopt
mlx-rsin Phase 2 as a newaerollm-backend-mlx-nativecrate; keep the subprocess shim alongside it for CI and a--legacy-subprocessfallback. - ADR 0004 — 4-bit quantization strategy for the in-process MLX backend (mlx-community checkpoint format, group-wise affine quantization).
- ADR 0005
— Speculative decoding is a first-class subsystem (
aerollm-speculativecrate,StreamingBackendspec-decode hooks, dual-backend driver). - ADR 0006 — Multi-backend strategy + CUDA support. Community contribution ladder; trait audit; crate skeleton sketch.
- ADR 0007 — 1.0 GA scope: Apple Silicon, Qwen2/Qwen2.5 + Llama 3.x dense, 15 acceptance criteria, MoE benchmarking plan.
- MLX integration — Phase 0 subprocess-shim decision (superseded for Phase 2+ by ADR 0003).
cargo test --workspace
cargo clippy --workspace --all-targets -- -D warnings
cargo fmt --all -- --checkRust toolchain pinned to 1.95.0 via rust-toolchain.toml.
Backends implement the
StreamingBackend supertrait
on top of Backend. The runtime drives generation a layer at a time
via AeroRuntime::generate_streaming, with weights fed from the
double-buffered PrefetchWorker:
tokenize(prompt) → prompt_tokens
begin_request(prompt_tokens) → state
loop until EOS or max_new_tokens:
apply_pre_block(state)
for layer in 0..num_layers:
apply_block(state, weights[layer]) ← weights from PrefetchWorker
logits = apply_post_block(state)
next = sample(logits)
publish Telemetry::TokenEmitted
detokenize(emitted) → text
NoOpBackend implements both surfaces, so cargo test -p aerollm-runtime
exercises the round-trip end-to-end without any model present.
Criterion suites live in crates/aerollm-bench:
cargo bench -p aerollm-bench --bench kv_cache # KV r/w, 1/4/16 shards, concurrent writers
cargo bench -p aerollm-bench --bench arena # PersistentArena + EphemeralArena scope cost
cargo bench -p aerollm-bench --bench generate # AeroRuntime::generate through NoOpBackendThe generate bench isolates runtime overhead (lifecycle gate, request-id counter, ephemeral arena scope, bus publish) from model-execution cost — useful as a regression floor.
If you want to call aeroLLM in-process from another Python application
ahead of the PyPI publish, the copy-pasteable recipe lives in
docs/integration-guide.md. It covers
building the wheel with maturin develop, the thread-affine Runtime
invariant (the non-obvious bit — MLX's Metal context is unsendable),
ChatML wrapping, ring eviction, env-var conventions, and a 5-second
smoke test.
Before deploying, size the box with
docs/capacity-sizing.md — formulas for
weights / KV / prefill peak / ring_depth against the two measured
anchors (70B-4bit and 405B-4bit on 24 GB), a model-by-model reference
table, an extrapolation rule when you don't have a measurement, and a
decision rule for when to enable ring_depth.
AeroLLM is Apache-2.0 and open to PRs. The highest-leverage contribution paths, in rough priority:
- CUDA backend port (
aerollm-backend-cuda). Start from ADR 0006 — it has a crate skeleton and a phased plan.aerollm-backend-mlx-nativeis the reference implementation; bit-identical output parity (modulo f32 tolerance) against that backend is the acceptance gate. - Mistral architecture port on the existing MLX backend. The
main delta from Qwen2/Llama3 is sliding-window attention. Follow
the
crates/aerollm-backend-mlx-native/src/llama3.rspattern; dispatch viamodel_typeinMlxNativeBackend::generate_macos. - Spec-decode benchmarks for more draft/target pairs.
spec_decode_acceptancecurrently measures Qwen2.5-0.5B draft + 7B target; community needs data on which pairs clear the 0.6 acceptance rate threshold. - Tiny-model fixtures for CI. A ~20 MB test checkpoint that lets PRs run the full streaming path without a 4 GB download.
Before starting, run cargo test --workspace and cargo clippy --workspace --all-targets -- -D warnings locally so you know the
baseline is green. CI gates on both.
Ideas that aren't ready yet:
- Custom CUDA kernels beyond what
cudarc/tchprovide — wait until the CUDA scaffold baseline is stable - ROCm — follows the CUDA pattern; queue it behind the CUDA port
- Windows — deferred; WSL2 + CUDA should work if someone wants to smoke-test
Open an issue or start a draft PR to discuss scope before a large change.