Halo

An NVMe → host-memory layer streamer for LLMs larger than VRAM.

Halo is a focused C library. It memory-maps a GGUF file, hands zero-copy tensor views to a downstream inference runtime (llama.cpp, Ollama, …), and issues OS-level prefetch / residency hints so the next transformer block's weights are already warm when compute reaches it.

Halo is not an inference engine. It does not implement a tokenizer, a forward pass, a sampler, or GPU compute kernels. Those are what llama.cpp and friends do well; Halo plugs into them.

Why

Modern NVMe drives read 5-7 GB/s. A 70B Q4 model is ~40 GB. You can't fit it in 12 GB of VRAM, but you can stream one transformer block at a time, roughly 500 MB for a 70B, so < 100 ms from cold cache, < 10 ms once the OS read-ahead is working for you. The existing tools do good CPU-RAM ↔ VRAM offloading; Halo's value is cleanly managing the NVMe → RAM half so that the downstream runtime never waits on a layer.

What ships

Crate	Purpose
`halo-core`	GGUF parser (zero-copy mmap), error types, quantization block format
`halo-io`	`MmapReader`, `DirectStorageReader` stub (future NVMe → VRAM bypass)
`halo-gpu`	Minimal wgpu context + `VramBuffer` (substrate for future tier-3 GPU interop)
`halo-runtime`	`LayerStreamer`: budget-aware LRU cache for layer-sized tensor sets
`halo-ffi`	C ABI (`libhalo_ffi`): `halo_open`, `halo_ensure_tensor`, `halo_prefetch_layer`, `halo_release_tensor`, `halo_get_stats`
`halo-bench`	Standalone benchmark; drives the C ABI, reports GB/s + p50/p99 per-layer latency

C header lives at crates/halo-ffi/include/halo.h.

Quick start

# Build the static + shared library
cargo build --release -p halo-ffi

# Run the benchmark on any GGUF model
cargo run --release -p halo-bench -- --model /path/to/model.gguf --iterations 5 --prefetch

Expected output:

── halo-ffi 0.1.0 ──
model      : /mnt/nvme/qwen3-7b-q4_0.gguf
open       : 3.41 ms
layers     : 28
prefetch   : true
warmup     : 0 passes
iterations : 5

pass 1/5: 912.3 ms   4.63 GB/s   196 tensors
pass 2/5:  84.7 ms   49.8 GB/s   196 tensors   ← OS page cache warm
...

per-layer µs   : p50=340.8  p99=2104.5

The cold-cache pass measures NVMe; warm passes measure how well the OS page cache holds up under halo_prefetch_layer hints.

Integration sketch (C)

#include "halo.h"

halo_config cfg = { .n_prefetch = 2 };
halo_ctx* ctx;
halo_open("/path/to/model.gguf", &cfg, &ctx);

for (uint32_t L = 0; L < n_layers; L++) {
    halo_prefetch_layer(ctx, L + 1);   // warm the next block's pages

    halo_tensor_view v;
    halo_ensure_tensor(ctx, "blk.7.attn_q.weight", &v);
    // v.data is a pointer directly into the mmap'd file.
    // Copy to your GPU however your runtime already does:
    cudaMemcpyAsync(device_ptr, v.data, v.size, cudaMemcpyHostToDevice, stream);
}

halo_close(ctx);

Roadmap

Tier	What	Status
1	mmap + OS prefetch hints	✅ this release
2	Pinned-host staging ring for DMA-friendly uploads	planned
3	Windows DirectStorage / Linux cuFile (GDS) NVMe → VRAM bypass	planned
—	llama.cpp fork exposing `--halo-stream` flag	planned

History

Halo originally tried to be a full inference engine (tokenizer, transformer, WGSL compute kernels, the works) built on the same zero-copy GGUF substrate. That code worked (see archive/ for the kernels and the speculative decoder skeleton), but duplicated llama.cpp for no user-visible win. The project was refocused on the one piece that has no good existing solution: the NVMe-side streamer. The archived compute code remains usable as a reference or as seed material for future experiments.

Star History

License

Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
archive/crates		archive/crates
crates		crates
scripts		scripts
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
MILESTONES.md		MILESTONES.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Halo

Why

What ships

Quick start

Integration sketch (C)

Roadmap

History

Star History

License

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Halo

Why

What ships

Quick start

Integration sketch (C)

Roadmap

History

Star History

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages