oMLX

LLM inference with persistent sessions, optimized for your Mac
Stateful KV caching, TurboQuant compression, continuous batching — for people who own their hardware.

What's Different · Install · Sessions · TurboQuant · Benchmarks · All Features · CLI Configuration

Recent: dtype fix turns a 2x slowdown into a 16% speedup

The benchmark suite caught a critical bug: TurboQuant was decompressing KV tensors to float32 instead of the model's native bfloat16. Every decode step ran at half speed with double memory. One-line fix — store the original dtype during compression, cast back on decompression.

Before fix (float32 KV):
8-turn session total: 50.6s (87% SLOWER than stateless)
After fix (bfloat16 KV):
8-turn session total: 32.3s (16% FASTER than stateless)
Turn 8: 87% cache hit, 2.74s vs 3.16s stateless
Memory: 45MB compressed (was 155MB uncompressed)
Park: 0.12s to SSD | Resume: instant
This is what benchmarks are for. Functional tests passed — every cache hit was correct, every response was valid. But the model was doing attention in float32 instead of bfloat16 and nobody would have noticed without wall-clock timing.

This is a fork of jundot/omlx — an excellent MLX inference server with continuous batching, tiered KV caching, multi-model serving, and a polished admin dashboard. All credit for the foundation goes to @jundot and the oMLX contributors. We built on top of their work.

What's Different

Every cloud LLM provider is stateless because they have to be. They serve 200 million users — they can't hold your KV cache between turns. Every request re-sends the entire conversation. The server recomputes everything from scratch.

You're not at scale. You're one household. One small team. One mesh of machines. The constraint doesn't apply.

This fork adds two things:

1. Persistent Sessions

After a chat completion, the KV cache stays in memory, tagged with a session ID. Next turn, only the new tokens are computed — everything prior is already there.

Turn 1:   0% cache hit   (cold start)
Turn 2:  56% cache hit   (prior context reused)
Turn 3:  68% cache hit   (compounding)
Turn 4:  74% cache hit   (3/4 of prefill is free)
Turn 20: nearly everything cached

Park a session to SSD when you walk away. Resume it hours later — KV state loads from a safetensors file in seconds. Pick up exactly where you left off.

2. TurboQuant KV Compression

Session KV cache is compressed 4x in memory using the algorithm from Zandieh et al. (ICLR 2026). Random rotation + Lloyd-Max scalar quantization at 3 bits per coordinate. MSE 0.034 — near-zero quality loss.

	Without TurboQuant	With TurboQuant
Session memory (short conv)	~155 MB	~40 MB
Concurrent sessions (18GB headroom)	~1 large	~4 large
Park file size	~155 MB	~155 MB (raw on SSD)

Compression is transparent — enabled by default, applied on store, reversed on retrieval.

Everything Else Still Works

The existing /v1/chat/completions endpoint, admin dashboard, multi-model serving, tiered KV cache, VLM support, embeddings, reranking — all unchanged. Sessions are additive.

Install

From This Fork

git clone https://github.com/solatticus/omlx.git
cd omlx
pip install -e .

From the Original (no sessions)

If you don't need sessions, use the original: github.com/jundot/omlx. It has a macOS app, Homebrew tap, and auto-updates.

Requirements

macOS 15.0+ (Sequoia)
Python 3.10+
Apple Silicon (M1/M2/M3/M4)

Quickstart

omlx serve --model-dir ~/models --paged-ssd-cache-dir ~/.omlx/cache

The server discovers models from subdirectories automatically. Any OpenAI-compatible client can connect to http://localhost:8000/v1.

Sessions

Sessions make the server remember. Create one, chat in it, and KV state persists between turns.

Endpoints

Method	Path	Description
`POST`	`/v1/sessions`	Create a session
`POST`	`/v1/sessions/{id}/chat`	Chat (KV retained)
`GET`	`/v1/sessions`	List all sessions
`GET`	`/v1/sessions/{id}`	Session details
`POST`	`/v1/sessions/{id}/park`	Serialize KV to SSD
`POST`	`/v1/sessions/{id}/resume`	Load KV from SSD
`DELETE`	`/v1/sessions/{id}`	Destroy session

Create and Chat

# Create a session
SESSION=$(curl -s http://localhost:8000/v1/sessions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen3.5-32B-4bit"}' | jq -r .session_id)

# Turn 1 — cold start, full prefill
curl http://localhost:8000/v1/sessions/$SESSION/chat \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Explain KV caching in transformers."}], "max_tokens": 256}'

# Turn 2 — prior context cached, only new tokens computed
curl http://localhost:8000/v1/sessions/$SESSION/chat \
  -H "Content-Type: application/json" \
  -d '{"messages": [
    {"role": "user", "content": "Explain KV caching in transformers."},
    {"role": "assistant", "content": "..."},
    {"role": "user", "content": "How does paged attention improve on this?"}
  ], "max_tokens": 256}'

The client sends full message history each turn (same as stateless). The server detects the shared prefix and skips recomputing it.

Park and Resume

# Park — KV serialized to SSD, memory freed
curl -X POST http://localhost:8000/v1/sessions/$SESSION/park

# Hours later...
curl -X POST http://localhost:8000/v1/sessions/$SESSION/resume

# Chat continues with cached KV intact

Streaming

curl -N http://localhost:8000/v1/sessions/$SESSION/chat \
  -H "Content-Type: application/json" \
  -d '{"messages": [...], "stream": true}'

SSE chunks include session_id, content, finished. The final chunk has usage, turn, and cached_tokens.

How It Works

On generation complete: Scheduler extracts KV tensors from the batch, stores them in a SessionKVStore keyed by session ID (instead of freeing them)
On next turn: Scheduler detects the session, finds the longest matching token prefix, reconstructs cache objects from stored tensors, injects them into the batch — skipping prefill for cached tokens
TurboQuant: KV tensors are compressed 4x on store and decompressed on retrieval, transparent to the scheduler
Park: Decompresses, serializes raw tensors to safetensors on SSD, frees memory
Resume: Loads from SSD, recompresses, session continues

The existing /v1/chat/completions stateless endpoint is completely unchanged.

TurboQuant

KV cache compression based on TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (Zandieh, Daliri, Hadian, Mirrokni — Google Research / NYU, ICLR 2026).

Algorithm

Rotate each KV vector by a random orthogonal matrix (QR decomposition, computed once per head dimension)
After rotation, coordinates are approximately i.i.d. from a known distribution (Beta converging to Gaussian)
Quantize each coordinate independently using precomputed Lloyd-Max centroids (3 bits = 8 levels)
Store the quantized indices (uint8) + vector norms (float16)
Dequantize: look up centroids, rotate back, rescale

Results

Tested on Qwen3.5-VL-122B-A10B (MoE, head_dim 128/256):

4x compression (3-bit indices stored as uint8)
MSE 0.034 (theoretical bound: 0.043)
Zero quality degradation on multi-turn conversations
Enabled by default, configurable via --kv-compress-bits 0/2/3/4

Benchmark Results

8-turn coding conversation on Qwen3.5-VL-122B-A10B-4bit-CRACK (M3 Ultra, 96GB):

Turn  Stateless          Session + TurboQuant
      time   prompt hit  time   ttft   prompt cached  hit
  1  18.13s     56   0%  13.89s 0.216s     56      0   0%
  2   2.65s    126   0%   2.64s 0.352s    126     54  42%
  3   2.74s    181   0%   2.58s 0.288s    181    124  68%
  4   2.83s    253   0%   2.63s 0.329s    253    179  70%
  5   2.91s    318   0%   2.62s 0.319s    318    251  78%
  6   3.01s    396   0%   2.64s 0.337s    396    316  79%
  7   3.08s    452   0%   2.60s 0.292s    452    394  87%
  8   3.16s    515   0%   2.74s 0.423s    515    450  87%
                -----                       -----
Total: 38.5s              32.3s (+16% faster)

Session memory: 45.4 MB (TurboQuant 3-bit compressed)
Park to SSD: 0.12s | Resume: instant

Cache hits compound every turn. By turn 8, 87% of the prompt is served from session KV cache. Sessions are faster than stateless at ~500 tokens — the gap widens with longer conversations.

Run the benchmark yourself:

python scripts/benchmark_sessions.py \
  --url http://localhost:8080 \
  --model YOUR_MODEL \
  --turns 10 --max-tokens 128

Features

Everything from upstream oMLX, plus sessions and TurboQuant.

From Upstream

Tiered KV Cache — hot (RAM) + cold (SSD) with prefix sharing and copy-on-write
Continuous Batching — concurrent request handling via mlx-lm BatchGenerator
Multi-Model Serving — LLMs, VLMs, embeddings, rerankers with LRU eviction
Vision-Language Models — multi-image chat, OCR auto-detection
Admin Dashboard — real-time monitoring, model management, chat UI, benchmarks
macOS Menubar App — native, not Electron
Per-Model Settings — sampling params, TTL, aliases, chat template kwargs
Tool Calling — function calling, JSON schema, MCP integration
API Compatibility — OpenAI and Anthropic drop-in replacement

Added in This Fork

Persistent Sessions — KV cache retained across turns, park/resume to SSD
TurboQuant Compression — 4x KV memory reduction with near-zero quality loss
Session-Aware Memory Enforcement — auto-parks LRU sessions under memory pressure
Thinking Model Support — handles generation prompt tokens correctly for reasoning models

CLI Configuration

# Basic: serve models with SSD cache
omlx serve --model-dir ~/models --paged-ssd-cache-dir ~/.omlx/cache

# Memory limits
omlx serve --model-dir ~/models --max-model-memory 32GB --max-process-memory 80%

# Hot cache + large initial block pool
omlx serve --model-dir ~/models --hot-cache-max-size 4GB --initial-cache-blocks 512

# With MCP tools
omlx serve --model-dir ~/models --mcp-config mcp.json

# API key authentication
omlx serve --model-dir ~/models --api-key your-secret-key

Sessions and TurboQuant are enabled automatically. No additional flags needed.

Architecture

FastAPI Server (OpenAI / Anthropic / Sessions API)
    │
    ├── SessionManager (session lifecycle, KV store, park/resume)
    │   └── SessionKVStore (in-memory, TurboQuant compressed)
    │       └── TurboQuant (rotate → quantize → store, 4x compression)
    │
    ├── EnginePool (multi-model, LRU eviction, TTL)
    │   ├── BatchedEngine (LLMs, continuous batching)
    │   ├── VLMEngine (vision-language models)
    │   ├── EmbeddingEngine
    │   └── RerankerEngine
    │
    ├── ProcessMemoryEnforcer (memory limit + session auto-park)
    │
    ├── Scheduler (FCFS, session cache injection)
    │   └── mlx-lm BatchGenerator
    │
    └── Cache Stack
        ├── PagedCacheManager (block-based, CoW, prefix sharing)
        ├── Hot Cache (in-memory tier, write-back)
        └── PagedSSDCacheManager (SSD cold tier, safetensors)

Models

Point --model-dir at a directory containing MLX-format model subdirectories.

Type	Models
LLM	Any model supported by mlx-lm
VLM	Qwen3.5 Series, GLM-4V, Pixtral, and other mlx-vlm models
Embedding	BERT, BGE-M3, ModernBERT
Reranker	ModernBERT, XLM-RoBERTa

License

Apache 2.0

Acknowledgments

jundot/omlx — the foundation. Continuous batching, tiered KV caching, multi-model serving, admin dashboard, macOS app. This fork builds directly on their work.
MLX and mlx-lm by Apple
mlx-vlm — Vision-language model inference on Apple Silicon
TurboQuant (Zandieh et al., ICLR 2026) — KV cache quantization algorithm
PolarQuant (Han et al., AISTATS 2026) — random preconditioning for KV compression
vllm-mlx — where oMLX started
venvstacks — portable Python environment layering
mlx-embeddings — embedding model support

Name		Name	Last commit message	Last commit date
Latest commit History 627 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
Formula		Formula
docs		docs
omlx		omlx
packaging		packaging
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.ja.md		README.ja.md
README.ko.md		README.ko.md
README.md		README.md
README.zh.md		README.zh.md
mcp.example.json		mcp.example.json
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

oMLX

Recent: dtype fix turns a 2x slowdown into a 16% speedup

What's Different

1. Persistent Sessions

2. TurboQuant KV Compression

Everything Else Still Works

Install

From This Fork

From the Original (no sessions)

Requirements

Quickstart

Sessions

Endpoints

Create and Chat

Park and Resume

Streaming

How It Works

TurboQuant

Algorithm

Results

Benchmark Results

Features

From Upstream

Added in This Fork

CLI Configuration

Models

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

oMLX

Recent: dtype fix turns a 2x slowdown into a 16% speedup

What's Different

1. Persistent Sessions

2. TurboQuant KV Compression

Everything Else Still Works

Install

From This Fork

From the Original (no sessions)

Requirements

Quickstart

Sessions

Endpoints

Create and Chat

Park and Resume

Streaming

How It Works

TurboQuant

Algorithm

Results

Benchmark Results

Features

From Upstream

Added in This Fork

CLI Configuration

Models

License

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages