LLM inference with persistent sessions, optimized for your Mac
Stateful KV caching, TurboQuant compression, continuous batching — for people who own their hardware.
What's Different · Install · Sessions · TurboQuant · Benchmarks · All Features · CLI Configuration
The benchmark suite caught a critical bug: TurboQuant was decompressing KV tensors to float32 instead of the model's native bfloat16. Every decode step ran at half speed with double memory. One-line fix — store the original dtype during compression, cast back on decompression.
Before fix (float32 KV):
8-turn session total: 50.6s (87% SLOWER than stateless)After fix (bfloat16 KV):
8-turn session total: 32.3s (16% FASTER than stateless) Turn 8: 87% cache hit, 2.74s vs 3.16s stateless Memory: 45MB compressed (was 155MB uncompressed) Park: 0.12s to SSD | Resume: instantThis is what benchmarks are for. Functional tests passed — every cache hit was correct, every response was valid. But the model was doing attention in float32 instead of bfloat16 and nobody would have noticed without wall-clock timing.
This is a fork of jundot/omlx — an excellent MLX inference server with continuous batching, tiered KV caching, multi-model serving, and a polished admin dashboard. All credit for the foundation goes to @jundot and the oMLX contributors. We built on top of their work.
Every cloud LLM provider is stateless because they have to be. They serve 200 million users — they can't hold your KV cache between turns. Every request re-sends the entire conversation. The server recomputes everything from scratch.
You're not at scale. You're one household. One small team. One mesh of machines. The constraint doesn't apply.
This fork adds two things:
After a chat completion, the KV cache stays in memory, tagged with a session ID. Next turn, only the new tokens are computed — everything prior is already there.
Turn 1: 0% cache hit (cold start)
Turn 2: 56% cache hit (prior context reused)
Turn 3: 68% cache hit (compounding)
Turn 4: 74% cache hit (3/4 of prefill is free)
Turn 20: nearly everything cached
Park a session to SSD when you walk away. Resume it hours later — KV state loads from a safetensors file in seconds. Pick up exactly where you left off.
Session KV cache is compressed 4x in memory using the algorithm from Zandieh et al. (ICLR 2026). Random rotation + Lloyd-Max scalar quantization at 3 bits per coordinate. MSE 0.034 — near-zero quality loss.
| Without TurboQuant | With TurboQuant | |
|---|---|---|
| Session memory (short conv) | ~155 MB | ~40 MB |
| Concurrent sessions (18GB headroom) | ~1 large | ~4 large |
| Park file size | ~155 MB | ~155 MB (raw on SSD) |
Compression is transparent — enabled by default, applied on store, reversed on retrieval.
The existing /v1/chat/completions endpoint, admin dashboard, multi-model serving, tiered KV cache, VLM support, embeddings, reranking — all unchanged. Sessions are additive.
git clone https://github.com/solatticus/omlx.git
cd omlx
pip install -e .If you don't need sessions, use the original: github.com/jundot/omlx. It has a macOS app, Homebrew tap, and auto-updates.
- macOS 15.0+ (Sequoia)
- Python 3.10+
- Apple Silicon (M1/M2/M3/M4)
omlx serve --model-dir ~/models --paged-ssd-cache-dir ~/.omlx/cacheThe server discovers models from subdirectories automatically. Any OpenAI-compatible client can connect to http://localhost:8000/v1.
Sessions make the server remember. Create one, chat in it, and KV state persists between turns.
| Method | Path | Description |
|---|---|---|
POST |
/v1/sessions |
Create a session |
POST |
/v1/sessions/{id}/chat |
Chat (KV retained) |
GET |
/v1/sessions |
List all sessions |
GET |
/v1/sessions/{id} |
Session details |
POST |
/v1/sessions/{id}/park |
Serialize KV to SSD |
POST |
/v1/sessions/{id}/resume |
Load KV from SSD |
DELETE |
/v1/sessions/{id} |
Destroy session |
# Create a session
SESSION=$(curl -s http://localhost:8000/v1/sessions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen3.5-32B-4bit"}' | jq -r .session_id)
# Turn 1 — cold start, full prefill
curl http://localhost:8000/v1/sessions/$SESSION/chat \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Explain KV caching in transformers."}], "max_tokens": 256}'
# Turn 2 — prior context cached, only new tokens computed
curl http://localhost:8000/v1/sessions/$SESSION/chat \
-H "Content-Type: application/json" \
-d '{"messages": [
{"role": "user", "content": "Explain KV caching in transformers."},
{"role": "assistant", "content": "..."},
{"role": "user", "content": "How does paged attention improve on this?"}
], "max_tokens": 256}'The client sends full message history each turn (same as stateless). The server detects the shared prefix and skips recomputing it.
# Park — KV serialized to SSD, memory freed
curl -X POST http://localhost:8000/v1/sessions/$SESSION/park
# Hours later...
curl -X POST http://localhost:8000/v1/sessions/$SESSION/resume
# Chat continues with cached KV intactcurl -N http://localhost:8000/v1/sessions/$SESSION/chat \
-H "Content-Type: application/json" \
-d '{"messages": [...], "stream": true}'SSE chunks include session_id, content, finished. The final chunk has usage, turn, and cached_tokens.
- On generation complete: Scheduler extracts KV tensors from the batch, stores them in a
SessionKVStorekeyed by session ID (instead of freeing them) - On next turn: Scheduler detects the session, finds the longest matching token prefix, reconstructs cache objects from stored tensors, injects them into the batch — skipping prefill for cached tokens
- TurboQuant: KV tensors are compressed 4x on store and decompressed on retrieval, transparent to the scheduler
- Park: Decompresses, serializes raw tensors to safetensors on SSD, frees memory
- Resume: Loads from SSD, recompresses, session continues
The existing /v1/chat/completions stateless endpoint is completely unchanged.
KV cache compression based on TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (Zandieh, Daliri, Hadian, Mirrokni — Google Research / NYU, ICLR 2026).
- Rotate each KV vector by a random orthogonal matrix (QR decomposition, computed once per head dimension)
- After rotation, coordinates are approximately i.i.d. from a known distribution (Beta converging to Gaussian)
- Quantize each coordinate independently using precomputed Lloyd-Max centroids (3 bits = 8 levels)
- Store the quantized indices (uint8) + vector norms (float16)
- Dequantize: look up centroids, rotate back, rescale
Tested on Qwen3.5-VL-122B-A10B (MoE, head_dim 128/256):
- 4x compression (3-bit indices stored as uint8)
- MSE 0.034 (theoretical bound: 0.043)
- Zero quality degradation on multi-turn conversations
- Enabled by default, configurable via
--kv-compress-bits 0/2/3/4
8-turn coding conversation on Qwen3.5-VL-122B-A10B-4bit-CRACK (M3 Ultra, 96GB):
Turn Stateless Session + TurboQuant
time prompt hit time ttft prompt cached hit
1 18.13s 56 0% 13.89s 0.216s 56 0 0%
2 2.65s 126 0% 2.64s 0.352s 126 54 42%
3 2.74s 181 0% 2.58s 0.288s 181 124 68%
4 2.83s 253 0% 2.63s 0.329s 253 179 70%
5 2.91s 318 0% 2.62s 0.319s 318 251 78%
6 3.01s 396 0% 2.64s 0.337s 396 316 79%
7 3.08s 452 0% 2.60s 0.292s 452 394 87%
8 3.16s 515 0% 2.74s 0.423s 515 450 87%
----- -----
Total: 38.5s 32.3s (+16% faster)
Session memory: 45.4 MB (TurboQuant 3-bit compressed)
Park to SSD: 0.12s | Resume: instant
Cache hits compound every turn. By turn 8, 87% of the prompt is served from session KV cache. Sessions are faster than stateless at ~500 tokens — the gap widens with longer conversations.
Run the benchmark yourself:
python scripts/benchmark_sessions.py \
--url http://localhost:8080 \
--model YOUR_MODEL \
--turns 10 --max-tokens 128Everything from upstream oMLX, plus sessions and TurboQuant.
- Tiered KV Cache — hot (RAM) + cold (SSD) with prefix sharing and copy-on-write
- Continuous Batching — concurrent request handling via mlx-lm BatchGenerator
- Multi-Model Serving — LLMs, VLMs, embeddings, rerankers with LRU eviction
- Vision-Language Models — multi-image chat, OCR auto-detection
- Admin Dashboard — real-time monitoring, model management, chat UI, benchmarks
- macOS Menubar App — native, not Electron
- Per-Model Settings — sampling params, TTL, aliases, chat template kwargs
- Tool Calling — function calling, JSON schema, MCP integration
- API Compatibility — OpenAI and Anthropic drop-in replacement
- Persistent Sessions — KV cache retained across turns, park/resume to SSD
- TurboQuant Compression — 4x KV memory reduction with near-zero quality loss
- Session-Aware Memory Enforcement — auto-parks LRU sessions under memory pressure
- Thinking Model Support — handles generation prompt tokens correctly for reasoning models
# Basic: serve models with SSD cache
omlx serve --model-dir ~/models --paged-ssd-cache-dir ~/.omlx/cache
# Memory limits
omlx serve --model-dir ~/models --max-model-memory 32GB --max-process-memory 80%
# Hot cache + large initial block pool
omlx serve --model-dir ~/models --hot-cache-max-size 4GB --initial-cache-blocks 512
# With MCP tools
omlx serve --model-dir ~/models --mcp-config mcp.json
# API key authentication
omlx serve --model-dir ~/models --api-key your-secret-keySessions and TurboQuant are enabled automatically. No additional flags needed.
Architecture
FastAPI Server (OpenAI / Anthropic / Sessions API)
│
├── SessionManager (session lifecycle, KV store, park/resume)
│ └── SessionKVStore (in-memory, TurboQuant compressed)
│ └── TurboQuant (rotate → quantize → store, 4x compression)
│
├── EnginePool (multi-model, LRU eviction, TTL)
│ ├── BatchedEngine (LLMs, continuous batching)
│ ├── VLMEngine (vision-language models)
│ ├── EmbeddingEngine
│ └── RerankerEngine
│
├── ProcessMemoryEnforcer (memory limit + session auto-park)
│
├── Scheduler (FCFS, session cache injection)
│ └── mlx-lm BatchGenerator
│
└── Cache Stack
├── PagedCacheManager (block-based, CoW, prefix sharing)
├── Hot Cache (in-memory tier, write-back)
└── PagedSSDCacheManager (SSD cold tier, safetensors)
Point --model-dir at a directory containing MLX-format model subdirectories.
| Type | Models |
|---|---|
| LLM | Any model supported by mlx-lm |
| VLM | Qwen3.5 Series, GLM-4V, Pixtral, and other mlx-vlm models |
| Embedding | BERT, BGE-M3, ModernBERT |
| Reranker | ModernBERT, XLM-RoBERTa |
- jundot/omlx — the foundation. Continuous batching, tiered KV caching, multi-model serving, admin dashboard, macOS app. This fork builds directly on their work.
- MLX and mlx-lm by Apple
- mlx-vlm — Vision-language model inference on Apple Silicon
- TurboQuant (Zandieh et al., ICLR 2026) — KV cache quantization algorithm
- PolarQuant (Han et al., AISTATS 2026) — random preconditioning for KV compression
- vllm-mlx — where oMLX started
- venvstacks — portable Python environment layering
- mlx-embeddings — embedding model support