Vortex

An LLM inference engine written in pure Rust, designed to run large models on hardware that would normally refuse them.

The core bet is that at any given token generation step, roughly 90% of neurons produce near-zero values. Standard inference ignores this and computes everything. Vortex does not.

The problem

A 70B parameter model needs around 35 GB of RAM in Q4 quantization. That is a $2,000+ GPU. Most of that compute is wasted on activations that contribute almost nothing to the output.

How Vortex approaches it

Stream inference (`vortex-stream`)

Model weights are never fully loaded into RAM. Instead they are read from an NVMe SSD layer by layer via mmap, computed, then discarded. The same way you stream a video rather than downloading it first.

Intelligent prefetch tuned to transformer access patterns
io_uring async I/O on Linux
Zero-copy weight access
Practical outcome: 70B model on an 8 GB machine

Sparse routing (`vortex-sparse`)

Before the expensive matrix multiplication, a lightweight scout network predicts which neurons will activate. Everything else is skipped.

Pre-computed activation bloom filters per layer
Tiny MLP scout for runtime sparsity prediction
Structured pruning at inference time
Practical outcome: 70B model computes closer to a 7B

Holographic weights (`vortex-holograph`)

Weight matrices are stored as superpositions of basis functions using DCT and wavelet decompositions. Like a hologram: cut it in half and you still see the full picture, just at lower resolution.

Progressive decoding: more coefficients loaded at runtime means higher quality
A single file adapts to whatever RAM is available
Practical outcome: 500 MB file instead of 4 GB, quality scales with hardware

Neural diff (`vortex-diff`)

Most fine-tuned models share more than 90% of their weights with their base. Vortex stores a base model plus binary diffs between variants, the same way Git stores code changes rather than full file copies.

Delta encoding designed for tensor data
Diff, patch, and merge operations
Practical outcome: 100 model variants in the space of five

Fractal quantization (built into all modules)

Precision is allocated dynamically based on what is being generated. Reasoning layers get higher precision during a math problem; style layers get priority during creative text. Quantization follows importance rather than a flat setting applied uniformly.

Architecture

vortex/
  crates/
    vortex-core/       # Tensor types, quantization, SIMD primitives
    vortex-stream/     # Streaming I/O engine (mmap, io_uring, prefetch)
    vortex-sparse/     # Sparsity prediction, bloom filters, scout networks
    vortex-holograph/  # DCT/wavelet compression, progressive decoding
    vortex-diff/       # Delta encoding, model diffing, patching
    vortex-runtime/    # Transformer execution engine, KV cache, sampling
    vortex-cli/        # CLI interface and TUI dashboard

Quick start

# Build
cargo build --release

# Run inference, streaming mode, no full model load
vortex infer --model ./mistral-7b-q4.vortex --prompt "Hello world"

# Inspect a model file
vortex inspect --model ./mistral-7b-q4.vortex

# Convert from GGUF to Vortex format with holographic compression
vortex convert --input model.gguf --output model.vortex --holographic

# Create a diff between two model variants
vortex diff --base llama3-base.vortex --target llama3-instruct.vortex --output instruct.vdiff

# Apply a diff
vortex patch --base llama3-base.vortex --diff instruct.vdiff --output llama3-instruct.vortex

Hardware requirements

Mode	RAM minimum	Storage	Relative speed
Full (classic)	4 GB per 7B	—	Baseline
Stream	512 MB	NVMe SSD	~3-5x slower
Stream + Sparse	512 MB	NVMe SSD	~1.5-2x slower
Holographic	1 GB	Any SSD	~2-3x slower
All combined	512 MB	NVMe SSD	~2x slower

Status

Early research prototype, not production ready.

Star History

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
crates		crates
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vortex

The problem

How Vortex approaches it

Stream inference (`vortex-stream`)

Sparse routing (`vortex-sparse`)

Holographic weights (`vortex-holograph`)

Neural diff (`vortex-diff`)

Fractal quantization (built into all modules)

Architecture

Quick start

Hardware requirements

Status

Star History

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vortex

The problem

How Vortex approaches it

Stream inference (vortex-stream)

Sparse routing (vortex-sparse)

Holographic weights (vortex-holograph)

Neural diff (vortex-diff)

Fractal quantization (built into all modules)

Architecture

Quick start

Hardware requirements

Status

Star History

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Stream inference (`vortex-stream`)

Sparse routing (`vortex-sparse`)

Holographic weights (`vortex-holograph`)

Neural diff (`vortex-diff`)

Packages