An LLM inference engine written in pure Rust, designed to run large models on hardware that would normally refuse them.
The core bet is that at any given token generation step, roughly 90% of neurons produce near-zero values. Standard inference ignores this and computes everything. Vortex does not.
A 70B parameter model needs around 35 GB of RAM in Q4 quantization. That is a $2,000+ GPU. Most of that compute is wasted on activations that contribute almost nothing to the output.
Model weights are never fully loaded into RAM. Instead they are read from an
NVMe SSD layer by layer via mmap, computed, then discarded. The same way you
stream a video rather than downloading it first.
- Intelligent prefetch tuned to transformer access patterns
io_uringasync I/O on Linux- Zero-copy weight access
- Practical outcome: 70B model on an 8 GB machine
Before the expensive matrix multiplication, a lightweight scout network predicts which neurons will activate. Everything else is skipped.
- Pre-computed activation bloom filters per layer
- Tiny MLP scout for runtime sparsity prediction
- Structured pruning at inference time
- Practical outcome: 70B model computes closer to a 7B
Weight matrices are stored as superpositions of basis functions using DCT and wavelet decompositions. Like a hologram: cut it in half and you still see the full picture, just at lower resolution.
- Progressive decoding: more coefficients loaded at runtime means higher quality
- A single file adapts to whatever RAM is available
- Practical outcome: 500 MB file instead of 4 GB, quality scales with hardware
Most fine-tuned models share more than 90% of their weights with their base. Vortex stores a base model plus binary diffs between variants, the same way Git stores code changes rather than full file copies.
- Delta encoding designed for tensor data
- Diff, patch, and merge operations
- Practical outcome: 100 model variants in the space of five
Precision is allocated dynamically based on what is being generated. Reasoning layers get higher precision during a math problem; style layers get priority during creative text. Quantization follows importance rather than a flat setting applied uniformly.
vortex/
crates/
vortex-core/ # Tensor types, quantization, SIMD primitives
vortex-stream/ # Streaming I/O engine (mmap, io_uring, prefetch)
vortex-sparse/ # Sparsity prediction, bloom filters, scout networks
vortex-holograph/ # DCT/wavelet compression, progressive decoding
vortex-diff/ # Delta encoding, model diffing, patching
vortex-runtime/ # Transformer execution engine, KV cache, sampling
vortex-cli/ # CLI interface and TUI dashboard
# Build
cargo build --release
# Run inference, streaming mode, no full model load
vortex infer --model ./mistral-7b-q4.vortex --prompt "Hello world"
# Inspect a model file
vortex inspect --model ./mistral-7b-q4.vortex
# Convert from GGUF to Vortex format with holographic compression
vortex convert --input model.gguf --output model.vortex --holographic
# Create a diff between two model variants
vortex diff --base llama3-base.vortex --target llama3-instruct.vortex --output instruct.vdiff
# Apply a diff
vortex patch --base llama3-base.vortex --diff instruct.vdiff --output llama3-instruct.vortex| Mode | RAM minimum | Storage | Relative speed |
|---|---|---|---|
| Full (classic) | 4 GB per 7B | — | Baseline |
| Stream | 512 MB | NVMe SSD | ~3-5x slower |
| Stream + Sparse | 512 MB | NVMe SSD | ~1.5-2x slower |
| Holographic | 1 GB | Any SSD | ~2-3x slower |
| All combined | 512 MB | NVMe SSD | ~2x slower |
Early research prototype, not production ready.
- Project architecture and workspace layout
- Core tensor types and quantization
- Streaming weight reader (mmap)
- Bloom filter sparsity predictor
- DCT holographic encoder/decoder
- Delta diff engine
- Full transformer forward pass
- KV cache management
- Tokenizer integration
- GGUF and SafeTensors import
- Benchmarks against llama.cpp
- GPU acceleration via wgpu
- io_uring backend
Apache-2.0