An NVMe → host-memory layer streamer for LLMs larger than VRAM.
Halo is a focused C library. It memory-maps a GGUF file, hands zero-copy tensor views to a downstream inference runtime (llama.cpp, Ollama, …), and issues OS-level prefetch / residency hints so the next transformer block's weights are already warm when compute reaches it.
Halo is not an inference engine. It does not implement a tokenizer, a forward pass, a sampler, or GPU compute kernels. Those are what llama.cpp and friends do well; Halo plugs into them.
Modern NVMe drives read 5-7 GB/s. A 70B Q4 model is ~40 GB. You can't fit it in 12 GB of VRAM, but you can stream one transformer block at a time, roughly 500 MB for a 70B, so < 100 ms from cold cache, < 10 ms once the OS read-ahead is working for you. The existing tools do good CPU-RAM ↔ VRAM offloading; Halo's value is cleanly managing the NVMe → RAM half so that the downstream runtime never waits on a layer.
| Crate | Purpose |
|---|---|
halo-core |
GGUF parser (zero-copy mmap), error types, quantization block format |
halo-io |
MmapReader, DirectStorageReader stub (future NVMe → VRAM bypass) |
halo-gpu |
Minimal wgpu context + VramBuffer (substrate for future tier-3 GPU interop) |
halo-runtime |
LayerStreamer: budget-aware LRU cache for layer-sized tensor sets |
halo-ffi |
C ABI (libhalo_ffi): halo_open, halo_ensure_tensor, halo_prefetch_layer, halo_release_tensor, halo_get_stats |
halo-bench |
Standalone benchmark; drives the C ABI, reports GB/s + p50/p99 per-layer latency |
C header lives at crates/halo-ffi/include/halo.h.
# Build the static + shared library
cargo build --release -p halo-ffi
# Run the benchmark on any GGUF model
cargo run --release -p halo-bench -- --model /path/to/model.gguf --iterations 5 --prefetchExpected output:
── halo-ffi 0.1.0 ──
model : /mnt/nvme/qwen3-7b-q4_0.gguf
open : 3.41 ms
layers : 28
prefetch : true
warmup : 0 passes
iterations : 5
pass 1/5: 912.3 ms 4.63 GB/s 196 tensors
pass 2/5: 84.7 ms 49.8 GB/s 196 tensors ← OS page cache warm
...
per-layer µs : p50=340.8 p99=2104.5
The cold-cache pass measures NVMe; warm passes measure how well the OS page
cache holds up under halo_prefetch_layer hints.
#include "halo.h"
halo_config cfg = { .n_prefetch = 2 };
halo_ctx* ctx;
halo_open("/path/to/model.gguf", &cfg, &ctx);
for (uint32_t L = 0; L < n_layers; L++) {
halo_prefetch_layer(ctx, L + 1); // warm the next block's pages
halo_tensor_view v;
halo_ensure_tensor(ctx, "blk.7.attn_q.weight", &v);
// v.data is a pointer directly into the mmap'd file.
// Copy to your GPU however your runtime already does:
cudaMemcpyAsync(device_ptr, v.data, v.size, cudaMemcpyHostToDevice, stream);
}
halo_close(ctx);| Tier | What | Status |
|---|---|---|
| 1 | mmap + OS prefetch hints | ✅ this release |
| 2 | Pinned-host staging ring for DMA-friendly uploads | planned |
| 3 | Windows DirectStorage / Linux cuFile (GDS) NVMe → VRAM bypass | planned |
| — | llama.cpp fork exposing --halo-stream flag |
planned |
Halo originally tried to be a full inference engine (tokenizer, transformer,
WGSL compute kernels, the works) built on the same zero-copy GGUF substrate.
That code worked (see archive/ for the kernels and the speculative decoder
skeleton), but duplicated llama.cpp for no user-visible win. The project was
refocused on the one piece that has no good existing solution: the NVMe-side
streamer. The archived compute code remains usable as a reference or as seed
material for future experiments.
Apache-2.0.