Skip to content

infinition/vortex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vortex

An LLM inference engine written in pure Rust, designed to run large models on hardware that would normally refuse them.

The core bet is that at any given token generation step, roughly 90% of neurons produce near-zero values. Standard inference ignores this and computes everything. Vortex does not.

The problem

A 70B parameter model needs around 35 GB of RAM in Q4 quantization. That is a $2,000+ GPU. Most of that compute is wasted on activations that contribute almost nothing to the output.

How Vortex approaches it

Stream inference (vortex-stream)

Model weights are never fully loaded into RAM. Instead they are read from an NVMe SSD layer by layer via mmap, computed, then discarded. The same way you stream a video rather than downloading it first.

  • Intelligent prefetch tuned to transformer access patterns
  • io_uring async I/O on Linux
  • Zero-copy weight access
  • Practical outcome: 70B model on an 8 GB machine

Sparse routing (vortex-sparse)

Before the expensive matrix multiplication, a lightweight scout network predicts which neurons will activate. Everything else is skipped.

  • Pre-computed activation bloom filters per layer
  • Tiny MLP scout for runtime sparsity prediction
  • Structured pruning at inference time
  • Practical outcome: 70B model computes closer to a 7B

Holographic weights (vortex-holograph)

Weight matrices are stored as superpositions of basis functions using DCT and wavelet decompositions. Like a hologram: cut it in half and you still see the full picture, just at lower resolution.

  • Progressive decoding: more coefficients loaded at runtime means higher quality
  • A single file adapts to whatever RAM is available
  • Practical outcome: 500 MB file instead of 4 GB, quality scales with hardware

Neural diff (vortex-diff)

Most fine-tuned models share more than 90% of their weights with their base. Vortex stores a base model plus binary diffs between variants, the same way Git stores code changes rather than full file copies.

  • Delta encoding designed for tensor data
  • Diff, patch, and merge operations
  • Practical outcome: 100 model variants in the space of five

Fractal quantization (built into all modules)

Precision is allocated dynamically based on what is being generated. Reasoning layers get higher precision during a math problem; style layers get priority during creative text. Quantization follows importance rather than a flat setting applied uniformly.

Architecture

vortex/
  crates/
    vortex-core/       # Tensor types, quantization, SIMD primitives
    vortex-stream/     # Streaming I/O engine (mmap, io_uring, prefetch)
    vortex-sparse/     # Sparsity prediction, bloom filters, scout networks
    vortex-holograph/  # DCT/wavelet compression, progressive decoding
    vortex-diff/       # Delta encoding, model diffing, patching
    vortex-runtime/    # Transformer execution engine, KV cache, sampling
    vortex-cli/        # CLI interface and TUI dashboard

Quick start

# Build
cargo build --release

# Run inference, streaming mode, no full model load
vortex infer --model ./mistral-7b-q4.vortex --prompt "Hello world"

# Inspect a model file
vortex inspect --model ./mistral-7b-q4.vortex

# Convert from GGUF to Vortex format with holographic compression
vortex convert --input model.gguf --output model.vortex --holographic

# Create a diff between two model variants
vortex diff --base llama3-base.vortex --target llama3-instruct.vortex --output instruct.vdiff

# Apply a diff
vortex patch --base llama3-base.vortex --diff instruct.vdiff --output llama3-instruct.vortex

Hardware requirements

Mode RAM minimum Storage Relative speed
Full (classic) 4 GB per 7B Baseline
Stream 512 MB NVMe SSD ~3-5x slower
Stream + Sparse 512 MB NVMe SSD ~1.5-2x slower
Holographic 1 GB Any SSD ~2-3x slower
All combined 512 MB NVMe SSD ~2x slower

Status

Early research prototype, not production ready.

  • Project architecture and workspace layout
  • Core tensor types and quantization
  • Streaming weight reader (mmap)
  • Bloom filter sparsity predictor
  • DCT holographic encoder/decoder
  • Delta diff engine
  • Full transformer forward pass
  • KV cache management
  • Tokenizer integration
  • GGUF and SafeTensors import
  • Benchmarks against llama.cpp
  • GPU acceleration via wgpu
  • io_uring backend

Star History

Star History Chart

License

Apache-2.0

About

An LLM inference engine written in pure Rust, designed to run large models on hardware that would normally refuse them.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages