Skip to content

infinition/halo

Repository files navigation

Halo

License Rust C Buy Me a Coffee

An NVMe → host-memory layer streamer for LLMs larger than VRAM.

Halo is a focused C library. It memory-maps a GGUF file, hands zero-copy tensor views to a downstream inference runtime (llama.cpp, Ollama, …), and issues OS-level prefetch / residency hints so the next transformer block's weights are already warm when compute reaches it.

Halo is not an inference engine. It does not implement a tokenizer, a forward pass, a sampler, or GPU compute kernels. Those are what llama.cpp and friends do well; Halo plugs into them.

Why

Modern NVMe drives read 5-7 GB/s. A 70B Q4 model is ~40 GB. You can't fit it in 12 GB of VRAM, but you can stream one transformer block at a time, roughly 500 MB for a 70B, so < 100 ms from cold cache, < 10 ms once the OS read-ahead is working for you. The existing tools do good CPU-RAM ↔ VRAM offloading; Halo's value is cleanly managing the NVMe → RAM half so that the downstream runtime never waits on a layer.

What ships

Crate Purpose
halo-core GGUF parser (zero-copy mmap), error types, quantization block format
halo-io MmapReader, DirectStorageReader stub (future NVMe → VRAM bypass)
halo-gpu Minimal wgpu context + VramBuffer (substrate for future tier-3 GPU interop)
halo-runtime LayerStreamer: budget-aware LRU cache for layer-sized tensor sets
halo-ffi C ABI (libhalo_ffi): halo_open, halo_ensure_tensor, halo_prefetch_layer, halo_release_tensor, halo_get_stats
halo-bench Standalone benchmark; drives the C ABI, reports GB/s + p50/p99 per-layer latency

C header lives at crates/halo-ffi/include/halo.h.

Quick start

# Build the static + shared library
cargo build --release -p halo-ffi

# Run the benchmark on any GGUF model
cargo run --release -p halo-bench -- --model /path/to/model.gguf --iterations 5 --prefetch

Expected output:

── halo-ffi 0.1.0 ──
model      : /mnt/nvme/qwen3-7b-q4_0.gguf
open       : 3.41 ms
layers     : 28
prefetch   : true
warmup     : 0 passes
iterations : 5

pass 1/5: 912.3 ms   4.63 GB/s   196 tensors
pass 2/5:  84.7 ms   49.8 GB/s   196 tensors   ← OS page cache warm
...

per-layer µs   : p50=340.8  p99=2104.5

The cold-cache pass measures NVMe; warm passes measure how well the OS page cache holds up under halo_prefetch_layer hints.

Integration sketch (C)

#include "halo.h"

halo_config cfg = { .n_prefetch = 2 };
halo_ctx* ctx;
halo_open("/path/to/model.gguf", &cfg, &ctx);

for (uint32_t L = 0; L < n_layers; L++) {
    halo_prefetch_layer(ctx, L + 1);   // warm the next block's pages

    halo_tensor_view v;
    halo_ensure_tensor(ctx, "blk.7.attn_q.weight", &v);
    // v.data is a pointer directly into the mmap'd file.
    // Copy to your GPU however your runtime already does:
    cudaMemcpyAsync(device_ptr, v.data, v.size, cudaMemcpyHostToDevice, stream);
}

halo_close(ctx);

Roadmap

Tier What Status
1 mmap + OS prefetch hints ✅ this release
2 Pinned-host staging ring for DMA-friendly uploads planned
3 Windows DirectStorage / Linux cuFile (GDS) NVMe → VRAM bypass planned
llama.cpp fork exposing --halo-stream flag planned

History

Halo originally tried to be a full inference engine (tokenizer, transformer, WGSL compute kernels, the works) built on the same zero-copy GGUF substrate. That code worked (see archive/ for the kernels and the speculative decoder skeleton), but duplicated llama.cpp for no user-visible win. The project was refocused on the one piece that has no good existing solution: the NVMe-side streamer. The archived compute code remains usable as a reference or as seed material for future experiments.

Star History

Star History Chart

License

Apache-2.0.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

 
 
 

Contributors