Skip to content

cdarnell/aerollm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

409 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

AeroLLM

Generic streaming inference runtime for LLMs that exceed GPU memory. Runs large language models and leave them to reside on disk. Agents don't mind latency if researching, save the VRAM.

Developed on Apple Silicon MLX today; CUDA and ROCm on the roadmap for community hands (see ADR 0006). I plan to open the source.

Enhancements like speculative decoding, per-token logprob capture, and replay bundles are pluggable opt-ins — not required for core operation. And still in development. The core POC is closed (ADR 0010); spec-decode is the first pluggable enhancement (ADR 0011, PRD at docs/product/speculative-decoding-prd.md).

This is a monorepo. Three tracks ship side by side:

  • Rust runtime (repo root) — the production lane. 13-crate workspace under crates/, driven by the aerollm CLI binary (cargo install --path crates/aerollm-cli). No Python interpreter required.
  • Python package (python/aerollm/) — an optional ergonomics layer for Jupyter / MLX users. Depends on the aerollm-api PyO3 wheel (pip install aerollm-api). Python is a compatibility surface, not a runtime — the Rust CLI path never loads Python.
  • Optional enhancements (aerollm-speculative and Phase R items) — pluggable subsystems opted into via --speculative / Python kwargs / Rust generic-bound methods. The core runtime works end-to-end without them.

See each subtree's own README for track-specific docs: crates/aerollm-api/README.md for the PyO3 bindings, python/aerollm/README.md for the Python package.

Status: Production-ready on a narrow slice for Apple Silicon: Qwen2/Qwen2.5 and Llama 3.x bf16-eager (passing 19/19 golden-output correctness gates vs mlx_lm). In-process MLX backend with mmap shard reads, bounded LRU shard cache, double-buffered layer prefetch from NVMe, ring eviction with force-free triad (ADR 0014), speculative decoding (greedy + Leviathan-2022), temperature / top-k / top-p / min-p sampling with seeded RNGs, replay bundles, and a message-bus telemetry surface. Rust CLI and Python wheel both ship. Empirically demonstrated frontier-scale: Llama-3.1-405B-4bit (215 GB on disk) on a 36 GB M5 Max with peak RSS ~12 GB, 0.004 tok/s, EXIT 0 — see docs/COMPATIBILITY.md §Tunables that actually matter for the recipe.

What aeroLLM is NOT today. Not interactive at 70B+ (Llama-3.1-405B-4bit measured at 0.004 tok/s — built for offline / overnight batch, not chat). Not GA — four ADR 0007 acceptance gates remain hardware-pending. No CUDA or ROCm backend (scaffold only; community-welcome). No native MoE (GLM-4.5, DeepSeek-V3 are Phase 4 roadmap). 4-bit oracle drops to 10/19 bit-exact at N=128 due to three pre-existing mlx_lm bf16 kernel-fusion fixtures (creative_poem, instruct_recipe, analysis_covid) — the 19/19 banner is bf16-eager-only. Phase A batched B=4 amortization is correctness-clean but fails its wall-time gates; aggregate throughput gain (~1.69× at 405B, 3.15× at 7B) overlaps with the shard-cache budget bump. See ADR 0016 for the full honest scope reset.

Community welcome. CUDA and ROCm aren't shipped yet — they're planned as parallel native crates mirroring aerollm-backend-mlx-native. ADR 0006 has the concrete contribution ladder (scaffold → Qwen2 port → more architectures).

Detailed subsystem-level tracking in MILESTONES.md.

"Can it run X?" → docs/COMPATIBILITY.md is the canonical, honest answer (architecture × quant × backend × feature, native vs measured-via-other-engine vs roadmap, with what's actually been verified). When this README and the matrix disagree, the matrix wins.

"Will it run on my Mac?" → docs/what-runs-on-your-mac.md. Commodity-hardware reality table — what actually hits chat-grade speed on 8 / 16 / 24 / 36 / 48–64 / 96–128 GB unified memory, measured where measurable and honestly extrapolated elsewhere. Includes the explicit "what aeroLLM does not claim" list — the headline being that we do not position aeroLLM as "405 B on 8 GB VRAM at chat speed." That framing is technically true on the memory axis, grossly misleading on the speed axis, and not something this project endorses. The actual commodity-hardware headline: Qwen3-30B-A3B-4bit MoE chats at ~31 tok/s on a 24 GB Mac mini, measured.

Why

Modern LLMs are outgrowing consumer hardware: a 70B-parameter dense model needs ~140 GB in bf16, a 355B-parameter MoE like GLM-4.5 needs ~710 GB. AeroLLM streams weights from NVMe into memory layer-by-layer, so the resident working set is bounded by the largest single layer (+ KV cache), not by the model.

The project is a multi-phase build from a tiny dense model on day one to a 1 TB-class MoE on a Mac Studio by Phase 4.

AeroLLM ships zero outbound network calls. All inference is local. Model downloads (via the user's own huggingface-cli) are the only network activity in the project's surface. The runtime, sampler, spec-decode driver, and replay subsystem are fully offline.

Supported models

AeroLLM streams weights layer-by-layer from disk. That requires a native architecture port plus a format it can mmap cheaply — so what you pick depends on which lane you're using.

Looking for the full compatibility matrix? See docs/COMPATIBILITY.md for architecture × quantization × backend × feature support, plus known bugs and hardware-pending validations. The summary below covers the green production path; the doc covers the rest honestly.

Supported today: Qwen2 and Llama 3.x on the native Rust backend

The in-process mlx-native backend (--backend mlx-native, macOS / Apple Silicon) is the production lane:

Family Sizes Format Status
Qwen2 / Qwen2.5 0.5B, 7B verified (larger sizes share the arch, unverified) bf16 safetensors Supported ✅ (19/19 gate)
Llama 3.x (3.1, 3.2, 3.3) 1B verified (3B/8B/70B share the arch, unverified) bf16 safetensors Supported ✅ (19/19 gate)
Qwen2 / Qwen2.5 any 4-bit (mlx-community Q4) safetensors ✅ Forward verified (token-identical to mlx_lm Q4); expect normal Q4 quality cost vs bf16 — see COMPATIBILITY.md

Note on sizes: the correctness gates run on the small checkpoints (0.5B/7B Qwen, 1B Llama). Larger sizes use the identical architecture port and are expected to work, but aren't individually gate-verified — COMPATIBILITY.md is honest about exactly what's been run.

MoE? Not natively — aeroLLM is dense-only today. MoE is measured via other engines in the benchmark matrix, with native MoE as the 2.0 headline. Full detail: the MoE status section.

# Qwen2.5
aerollm generate --backend mlx-native \
    --model /path/to/Qwen2.5-7B-Instruct \
    --prompt "Summarize the last email I got."

# Llama 3.x
aerollm generate --backend mlx-native \
    --model /path/to/Llama-3.1-8B-Instruct \
    --prompt "Explain RoPE positional encoding."

Roadmap

Family Phase Notes
Mistral 7B / Mixtral-dense 3 Sliding-window attention variant
GLM-4.5 (MoE, 355B) 4 Full MoE router + per-expert streaming — P4 flagship
Llama 4 (MoE) 4 Reuses the GLM-4.5 MoE plumbing

Python-lane shortcut

The aerollm Python package (python/aerollm/) wraps the Rust runtime through the aerollm-api PyO3 wheel, so model support is identical: Qwen2 and Llama 3.x today, MoE at Phase 4. Useful from Jupyter or for one-off scripts; the Rust CLI remains the production path.

What about the subprocess shim (--backend mlx)?

That backend shells out to mlx-lm and loads the whole model into memory on every call. It's still there for CI, A/B comparisons, and as a --legacy-subprocess fallback (see ADR 0003), and it will run any architecture mlx-lm supports — but it does not stream from disk. If streaming is what you're here for, use mlx-native.

Linux / CUDA / ROCm?

Not shipped — but also not stuck waiting on a particular release. ADR 0006 decides CUDA lives as aerollm-backend-cuda, a new workspace crate parallel to aerollm-backend-mlx-native, and a contribution ladder breaks it into chunks:

  1. Scaffold the crate + Qwen2 port — unblocks Linux users
  2. Llama 2/3 port (small delta on Qwen2's skeleton)
  3. GGUF loader, more architectures
  4. ROCm via HIP (same pattern as CUDA)

The StreamingBackend trait in aerollm-backend is already device-agnostic (audited 2026-04-23). A contributor can add CUDA support without touching any existing crate's public API.

In the meantime, if you need streaming on a Linux GPU host today, AirLLM works — AeroLLM is a Rust/production successor but AirLLM's Python lane is still the only shipping option there.

AeroLLM vs. AirLLM

AirLLM proved the layer-streaming thesis; AeroLLM is the production rewrite. The Python package in this monorepo descends from the AirLLM fork, but the point of the Rust workspace is to strip out the Python hot path AirLLM was built on:

  • Single binary. The aerollm CLI is a Rust process. Python bindings exist but are optional and off the hot path — aerollm generate never loads an interpreter.
  • Deterministic lifecycle. Init → Loading → Running → ShuttingDown is enforced by a state machine and observable on the in-process bus. No "did the model load yet?" ambiguity.
  • Real prefetch. A double-buffered worker (aerollm-prefetch) reads layer N+1 from NVMe while the GPU is executing layer N. AirLLM loads each layer synchronously.
  • Sharded KV cache. aerollm-kv uses parking_lot::RwLock + Condvar sharded to the number of attention heads — single-writer contention on long contexts is gone.
  • Two-tier arena. Persistent weights and per-request ephemeral tensors live in separate bumpalos, so request teardown is a single reset() call, not a heap walk.

AirLLM is fine on a research laptop. AeroLLM is where you go when you need the same trick to survive a long-running server.

Architecture

flowchart LR
    cli[aerollm-cli] --> rt[aerollm-runtime]
    rt --> be[aerollm-backend<br/>trait]
    be --> native[aerollm-backend-mlx-native<br/>in-process MLX · macOS]
    be --> shim[aerollm-backend-mlx<br/>subprocess · CI / legacy]
    be --> noop[NoOpBackend<br/>reference / Linux]
    rt --> spec[aerollm-speculative<br/>greedy + Leviathan-2022]
    rt --> bus[aerollm-bus<br/>lifecycle / telemetry]
    rt --> kv[aerollm-kv<br/>sharded cache]
    rt --> arena[aerollm-arena<br/>two-tier bumpalo]
    native --> prefetch[aerollm-prefetch<br/>double-buffered NVMe]
    native --> tok_q[aerollm-tokenizer-qwen2]
    native --> tok_l[aerollm-tokenizer-llama]
Loading

All crates live under crates/. The CLI is a thin dispatcher; aerollm-runtime drives the full lifecycle. aerollm-backend-mlx-native is the production lane — in-process Metal via mlx-rs with double-buffered layer streaming from NVMe.

Quickstart

# 1. Clone + build
git clone https://github.com/cdarnell/aerollm
cd aerollm
cargo build --release

# 2. Download a model (HuggingFace Hub, one-time)
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct \
    --local-dir ~/models/Qwen2.5-0.5B-Instruct

# 3. Install the MLX runtime dependency (macOS, one-time)
pip install mlx-lm

# 4. Generate
./target/release/aerollm generate \
    --backend mlx-native \
    --model ~/models/Qwen2.5-0.5B-Instruct \
    --prompt "Write one sentence about Apple Silicon."

Expected: a streamed sentence about Apple Silicon on stdout.

Smoke test without MLX

The no-op backend works anywhere — handy for CI, Linux workstations, or just sanity-checking the wiring:

mkdir -p /tmp/aerollm-noop
./target/release/aerollm generate \
    --backend noop \
    --model /tmp/aerollm-noop \
    --prompt "hello"
# Apple Silicon uses a unified-memory architecture.

Layout

aerollm/                       # monorepo root
├── crates/                    # Rust workspace (production lane)
│   ├── aerollm-tensor/               # dtype, shape, tensor buffer
│   ├── aerollm-instrumentation/      # atomic counters + tracing
│   ├── aerollm-bus/                  # in-process event bus (lifecycle + telemetry)
│   ├── aerollm-arena/                # two-tier bumpalo arena (persistent + ephemeral)
│   ├── aerollm-kv/                   # sharded KV cache (parking_lot RwLock + Condvar)
│   ├── aerollm-prefetch/             # double-buffered weight prefetch worker
│   ├── aerollm-backend/              # Backend + StreamingBackend traits + NoOpBackend
│   ├── aerollm-backend-mlx/          # MLX subprocess shim (CI / --legacy-subprocess)
│   ├── aerollm-backend-mlx-native/   # in-process MLX via mlx-rs 0.25 (macOS, production)
│   ├── aerollm-backend-cuda/         # CUDA scaffold (community contribution target)
│   ├── aerollm-tokenizer-qwen2/      # Qwen2 BPE + ChatML template
│   ├── aerollm-tokenizer-llama/      # Llama 3.x BPE + chat template
│   ├── aerollm-speculative/          # spec-decode driver (greedy + Leviathan-2022)
│   ├── aerollm-runtime/              # AeroRuntime<B> orchestrator + LifecycleController
│   ├── aerollm-correctness/          # golden-fixture harness + perplexity regression suite
│   ├── aerollm-bench/                # criterion benches (KV / arena / generate / spec-decode)
│   ├── aerollm-cli/                  # `aerollm` binary
│   └── aerollm-api/                  # PyO3 bindings → wheel
├── python/
│   └── aerollm/               # Python package (ergonomics lane, optional)
│       ├── aerollm/mlx/       # adapter — delegates to aerollm_api
│       ├── pyproject.toml
│       └── tests/
├── docs/
│   ├── decisions/             # ADRs 0001–0007
│   └── benchmarks/            # AirLLM comparison plan + bench recipes
├── fuzz/                      # cargo-fuzz targets (ChatML, safetensors, replay bundle)
├── tests/
│   ├── fixtures/              # Model weights (not checked in)
│   └── perplexity/            # 100-prompt perplexity regression fixture
├── api-baseline/              # cargo public-api snapshots (CI gate)
├── .github/workflows/         # CI: tests, clippy, fuzz, release, public-api
├── MILESTONES.md
├── MAINTAINERS.md
├── SECURITY.md
├── NOTICE
└── LICENSE                    # Apache-2.0

Architectural decisions

  • ADR 0001 — Crate prefix is aerollm-*; product name remains AeroLLM.
  • ADR 0002 — In-process message bus deferred to Phase 1 for lifecycle + telemetry fan-out.
  • ADR 0003 — Adopt mlx-rs in Phase 2 as a new aerollm-backend-mlx-native crate; keep the subprocess shim alongside it for CI and a --legacy-subprocess fallback.
  • ADR 0004 — 4-bit quantization strategy for the in-process MLX backend (mlx-community checkpoint format, group-wise affine quantization).
  • ADR 0005 — Speculative decoding is a first-class subsystem (aerollm-speculative crate, StreamingBackend spec-decode hooks, dual-backend driver).
  • ADR 0006 — Multi-backend strategy + CUDA support. Community contribution ladder; trait audit; crate skeleton sketch.
  • ADR 0007 — 1.0 GA scope: Apple Silicon, Qwen2/Qwen2.5 + Llama 3.x dense, 15 acceptance criteria, MoE benchmarking plan.
  • MLX integration — Phase 0 subprocess-shim decision (superseded for Phase 2+ by ADR 0003).

Development

cargo test --workspace
cargo clippy --workspace --all-targets -- -D warnings
cargo fmt --all -- --check

Rust toolchain pinned to 1.95.0 via rust-toolchain.toml.

Streaming protocol

Backends implement the StreamingBackend supertrait on top of Backend. The runtime drives generation a layer at a time via AeroRuntime::generate_streaming, with weights fed from the double-buffered PrefetchWorker:

tokenize(prompt) → prompt_tokens
begin_request(prompt_tokens) → state
loop until EOS or max_new_tokens:
    apply_pre_block(state)
    for layer in 0..num_layers:
        apply_block(state, weights[layer])    ← weights from PrefetchWorker
    logits = apply_post_block(state)
    next   = sample(logits)
    publish Telemetry::TokenEmitted
detokenize(emitted) → text

NoOpBackend implements both surfaces, so cargo test -p aerollm-runtime exercises the round-trip end-to-end without any model present.

Benchmarks

Criterion suites live in crates/aerollm-bench:

cargo bench -p aerollm-bench --bench kv_cache   # KV r/w, 1/4/16 shards, concurrent writers
cargo bench -p aerollm-bench --bench arena      # PersistentArena + EphemeralArena scope cost
cargo bench -p aerollm-bench --bench generate   # AeroRuntime::generate through NoOpBackend

The generate bench isolates runtime overhead (lifecycle gate, request-id counter, ephemeral arena scope, bus publish) from model-execution cost — useful as a regression floor.

Embedding aeroLLM in another product

If you want to call aeroLLM in-process from another Python application ahead of the PyPI publish, the copy-pasteable recipe lives in docs/integration-guide.md. It covers building the wheel with maturin develop, the thread-affine Runtime invariant (the non-obvious bit — MLX's Metal context is unsendable), ChatML wrapping, ring eviction, env-var conventions, and a 5-second smoke test.

Before deploying, size the box with docs/capacity-sizing.md — formulas for weights / KV / prefill peak / ring_depth against the two measured anchors (70B-4bit and 405B-4bit on 24 GB), a model-by-model reference table, an extrapolation rule when you don't have a measurement, and a decision rule for when to enable ring_depth.

Contributing

AeroLLM is Apache-2.0 and open to PRs. The highest-leverage contribution paths, in rough priority:

  1. CUDA backend port (aerollm-backend-cuda). Start from ADR 0006 — it has a crate skeleton and a phased plan. aerollm-backend-mlx-native is the reference implementation; bit-identical output parity (modulo f32 tolerance) against that backend is the acceptance gate.
  2. Mistral architecture port on the existing MLX backend. The main delta from Qwen2/Llama3 is sliding-window attention. Follow the crates/aerollm-backend-mlx-native/src/llama3.rs pattern; dispatch via model_type in MlxNativeBackend::generate_macos.
  3. Spec-decode benchmarks for more draft/target pairs. spec_decode_acceptance currently measures Qwen2.5-0.5B draft + 7B target; community needs data on which pairs clear the 0.6 acceptance rate threshold.
  4. Tiny-model fixtures for CI. A ~20 MB test checkpoint that lets PRs run the full streaming path without a 4 GB download.

Before starting, run cargo test --workspace and cargo clippy --workspace --all-targets -- -D warnings locally so you know the baseline is green. CI gates on both.

Ideas that aren't ready yet:

  • Custom CUDA kernels beyond what cudarc/tch provide — wait until the CUDA scaffold baseline is stable
  • ROCm — follows the CUDA pattern; queue it behind the CUDA port
  • Windows — deferred; WSL2 + CUDA should work if someone wants to smoke-test

Open an issue or start a draft PR to discuss scope before a large change.

License

Apache-2.0. See LICENSE and NOTICE.

About

Rust-native monorepo for streaming large LLMs from disk on Apple Silicon. In-process MLX backend (mlx-rs), sharded KV, two-tier arena, and Python bindings. Qwen2 today; Llama/GLM on the roadmap.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors