🚀 TurboQuant KV cache compression is now in vLLM (PR #38479, merged April 2026):
--kv-cache-dtype turboquant_k8v4and friends, with fused Triton store/decode kernels. The PR discussion drew on the asymmetric K/V findings from this repo. Upstream llama.cpp has merged the core idea too: Hadamard KV cache rotation (#21038, citing TurboQuant directly) with fast WHT kernels on CPU (#22631), CUDA (#23615), and Vulkan (#23687). Rotation + the stock q4_0 cache is essentially turbo4's rotation stage; the PolarQuant codebook, norm extraction, and asymmetric policies remain here and in the fork.
Implementation of TurboQuant (ICLR 2026) with implementation work, experiments, and follow-on findings beyond the base paper. Compresses transformer KV cache 3.8-6.4x using PolarQuant + Walsh-Hadamard rotation, at near q8_0 prefill speed and ~0.9x decode throughput at long context. Validated end-to-end from 1.5B to 104B at 128K context on a MacBook (turbo3, PPL 4.024, 74 GB peak memory).
This repository is the research home: the Python reference implementation, the validation papers, and the benchmark data. To run TurboQuant in an inference engine, pick from the table below. Pieces that prove useful and stable get upstreamed incrementally as small, reviewable patches.
| Engine | Platform | Status | Notes |
|---|---|---|---|
| vLLM | CUDA / ROCm, datacenter | Upstream, merged | --kv-cache-dtype turboquant_k8v4 and friends (PR #38479) |
| llama.cpp | All backends | Upstream, rotation merged | Hadamard KV cache rotation (#21038) + fast WHT kernels (CPU #22631, CUDA #23615, Vulkan #23687). Rotation + q4_0 cache approximates turbo4; full PolarQuant codec is in the fork below |
| llama-cpp-turboquant | Metal, CUDA, HIP, CPU | Production fork | turbo2/3/4 KV cache + TQ3_1S/TQ4_1S weight formats; prebuilt binaries for Mac (Metal) and Windows (CUDA) |
| mlx-swift-lm | Apple Silicon, Swift | Active collaboration | Fastest Apple path: ~2.5x faster decode than Python mlx-lm, full TQ+ support including turbo4v2. 144 tok/s on Qwen3.5-35B-A3B MoE at 4K on M5 Max |
| vllm-swift | Apple Silicon, Swift | Active | OpenAI-compatible serving built on mlx-swift-lm; no Python in the inference hot path |
| Atlas | Rust, Metal | Integrated | turbo4 KV cache append/decode kernels |
| mlxcel | Rust, MLX | Community port | Turbo KV cache docs. Thanks to @inureyes and the Lablup team for the careful port and upstream attribution |
| MLX Python fork | Apple Silicon, Python | Experimental | TurboKVCache drop-in for mlx-lm and mlx-vlm; see docs/mlx-port.md |
| turboquant-pytorch | PyTorch | Independent implementation | From-scratch PyTorch TurboQuant for KV cache; applies this repo's layer-adaptive compression and Sparse V findings |
| turboquant-vllm | CUDA, vLLM package | Community package | TQ+ KV cache for vLLM with fused CUDA kernels; builds on this repo's block_size=128 and K-dominance research |
| Quansloth | Local server product | Community product | Air-gapped local AI server built on the TurboQuant+ stack |
Three follow-on findings, independently validated by multiple researchers across different hardware and backends:
- V compression is free. Compressing the value cache (even down to 2 bits) has zero measurable effect on attention quality when key precision is maintained. Confirmed on Metal (M5 Max), CUDA RTX 4090 (@sztlink), and CUDA RTX 3090 (@HyperionMS2040). See asymmetric K/V paper.
- All quality degradation comes from K compression. This is why asymmetric configs (q8_0-K + turbo-V) rescue models where symmetric fails. Validated across Qwen, Llama, Mistral, and Command-R+ families. See M5 Max stress test.
- Boundary layers are disproportionately sensitive. Protecting the first 2 + last 2 layers at higher precision recovers 37-91% of the quality gap. See Boundary V paper.
Additional experiments and writeups: Sparse V dequant (+22.8% decode at 32K, not TurboQuant-specific), block size optimization (5.12x compression), turbo4 resurrection (QJL hurts, PolarQuant works), EDEN optimal-S response (rotation is first-order, scale is second-order).
| Cache Type | Bits/val | Compression | PPL (wikitext-2, 512c) | vs q8_0 |
|---|---|---|---|---|
| f16 | 16.0 | 1.0x | 6.121 | -0.16% |
| q8_0 | 8.5 | 1.9x | 6.111 | baseline |
| turbo4 | 4.25 | 3.8x | 6.125 | +0.23% |
| q4_0 | 4.5 | 3.6x | 6.142 | +0.52% |
| turbo3 | 3.5† | 4.6x† | 6.176 | +1.06% |
| turbo2 | 2.5 | 6.4x | 6.507 | +6.48% |
turbo4 (4-bit PolarQuant) has the best quality after q8_0 — closer to q8_0 than q4_0, at better compression. turbo3 trades quality for maximum compression. turbo2 (2-bit) trades more quality for extreme compression — best used asymmetrically.
†turbo3 at default block_size=32. At block_size=128, turbo3 achieves 3.125 bits/val and 5.12x compression with identical PPL. See block size study.
Important: choosing the right config for your model. TurboQuant quality depends on your base weight quantization. Models with Q8_0+ weights work well with symmetric turbo (e.g.,
-ctk turbo3 -ctv turbo3). Some low-bit models with Q4_K_M weights may benefit from asymmetric K/V: use-ctk q8_0 -ctv turbo4to keep K precision high while compressing V. K precision is the dominant quality factor because it controls attention routing via softmax. Bigger models absorb quantization stacking better (104B: +3.6% vs 70B: +11.4% for turbo3). Validate on your specific model. See Configuration Recommendations for the full tested matrix.
Everything else lives in docs/benchmarks.md: asymmetric K/V and Boundary V tables, prefill context scaling, MoE and dense decode speed, NIAH retrieval, KL divergence, 70B/104B stress tests, community results on RTX 3090, M1 Max, and AMD RX 9070 XT, and the speed optimization journey.
git clone https://github.com/TheTom/turboquant_plus.git
cd turboquant_plus
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
# Verify — should print "141 passed"
python3 -m pytest tests/ -v
# Quick compression demo (no model needed)
python3 benchmarks/demo.py
# Validate on real model KV tensors (downloads Qwen3-1.7B, ~4GB)
pip install transformers torch accelerate
python3 benchmarks/validate_real_model.pyRequires Python >= 3.10, NumPy >= 1.24, SciPy >= 1.10. torch / transformers / accelerate are optional (real-model validation only).
The fastest path is a prebuilt binary, or build from the fork — see the llama-cpp-turboquant quick start for build instructions, supported backends, and usage details.
# Server mode with TurboQuant KV cache
./build/bin/llama-server -m models/your-model.gguf \
--jinja -ngl 99 -c 262144 -fa on \
--cache-type-k turbo3 --cache-type-v turbo3 \
-np 1 --host 0.0.0.0 --port 8080| Flag | Bits/val | Compression vs fp16 | Description |
|---|---|---|---|
turbo3 |
3.5† | 4.6x† | 3-bit PolarQuant + WHT rotation. Best compression, q8_0 speed. |
turbo4 |
4.25 | 3.8x | 4-bit PolarQuant (16 centroids). Best quality. |
q8_0 |
8 | 2.0x | llama.cpp default quantized cache. |
q4_0 |
4 | 4.0x | llama.cpp 4-bit cache. |
For per-model configuration guidance (symmetric vs asymmetric, Boundary V, large-model memory caps), see the Getting Started Guide and Configuration Recommendations.
Input: KV cache vector x ∈ R^d (one attention head)
│
├── Extract norm: γ = ||x||, x̂ = x/γ
│
├── Random rotation: WHT + random sign flips
│ coordinates ~ N(0, 1/d) after rotation
│
├── Optimal scalar quantization (Lloyd-Max)
│ turbo4: 16 centroids (4-bit), turbo3: 8 centroids (3-bit), turbo2: 4 centroids (2-bit)
│
└── Output: quantized indices + norm per block
Compression: 3.8x (turbo4), 5.1x (turbo3), 7.5x (turbo2)
Note on QJL: reference only, not used in production.
The original TurboQuant paper (Zandieh et al. 2024, arXiv 2406.03482) includes a 1-bit QJL error-correction stage. The Python
qjl.pyhere implements it for paper reproducibility.Production drops QJL on both K and V. QJL eliminates reconstruction bias but amplifies variance, which softmax turns into attention noise. Five independent groups confirmed (buun, scos-lab, Arclabs001, +2). See turbo4-resurrection.md for the full ablation and mechanism.
If you're building on this repo: use
TurboQuantMSE(V cache), or implement straight 4 to 8 bit PolarQuant on K. Only enable theQJL/TurboQuant(with QJL) classes if you are reproducing the original paper or doing K-side research below 8-bit.
Project Structure
turboquant/
├── rotation.py # Walsh-Hadamard Transform + random sign flips
├── codebook.py # Lloyd-Max optimal centroid computation
├── polar_quant.py # PolarQuant — norm extraction + WHT rotation + scalar quantization
├── qjl.py # QJL 1-bit quantizer (paper-faithful reference, see README §QJL). Not used in production.
├── turboquant.py # Full TurboQuant pipeline
├── kv_cache.py # KV cache integration layer
├── outlier.py # Outlier channel strategy (2.5-bit, 3.5-bit)
├── lloyd_max.py # Lloyd-Max quantizer implementation
├── utils.py # Bit packing, memory measurement
├── isoquant.py # IsoQuant (quaternion SO(4)) experimental comparison
└── rotorquant.py # RotorQuant experimental comparison
tests/ # 14 test files, 500+ tests
benchmarks/
├── demo.py # Quick compression demo
├── run_benchmark.py # Server-based benchmark runner
├── benchmark_results.md # Full benchmark report
├── benchmark_llama.sh # llama.cpp benchmark script
├── benchmark_norm_correction.py # Norm correction validation
├── benchmark_ppl_tq_vs_rq.py # TurboQuant vs RotorQuant PPL comparison
├── temporal_decay_prototype.py # Temporal decay experiment
├── test_with_llama.py # Integration test at Qwen 3.5 dimensions
├── test_outlier_comparison.py # Outlier strategy comparison
└── validate_real_model.py # Real model KV tensor validation
docs/
├── benchmarks.md # Full benchmark and validation data
├── mlx-port.md # MLX framework port (Python)
├── changelog.md # v1 milestone history
├── turboquant-recommendations.md # Configuration guide (tested matrix)
├── windows-rdna4-setup.md # Windows + AMD RDNA 4 build guide
├── papers/ # Validation papers and experiment writeups
└── (25+ engineering docs, investigations, experiment logs)
| Phase | Status | Details |
|---|---|---|
| Core algorithms (NumPy) | ✅ | 500+ tests across 14 test files |
| Distortion validation | ✅ | Matches paper bounds (Table 2) |
| Real model validation | ✅ | Rotation validated on Qwen3 KV tensors (kurtosis 900→2.9) |
| llama.cpp C port | ✅ | Metal GPU inference working on M1 through M5 |
| Metal shader optimization | ✅ | q8_0 speed parity: prefill matches or beats q8_0 |
| CUDA backend | ✅ | Community-tested on RTX 3080 Ti/3090/4090/5090, DGX Spark Blackwell |
| HIP/AMD backend | ✅ | RX 9070 XT (RDNA 4) validated, gfx1201 native |
| Asymmetric K/V | ✅ | q8_0-K + turbo-V rescues Q4_K_M models |
| Boundary V | ✅ | Layer-aware V compression, 37-91% quality recovery |
| Sparse V | ✅ | Attention-gated dequant skip, +22.8% decode on MoE. Upstream PR #21119 |
| Block size optimization | ✅ | 32→128, 12% better compression, zero quality cost |
| vLLM upstream | ✅ | Merged as the TurboQuant attention backend (PR #38479) |
| llama.cpp upstream (rotation) | ✅ | Hadamard KV rotation + FWHT kernels merged upstream (#21038, #22631) |
| Upstream coordination | 🔄 | llama.cpp PR preparation for the full codec (#27) |
| TurboQuant+ extensions | ⏳ | Adaptive bits, temporal decay, MoE-aware compression |
| MLX Swift port | 🔄 | Active collaboration with @ekryski on mlx-swift-lm — turbo4v2 working |
- TurboQuant: arXiv 2504.19874 (ICLR 2026)
- PolarQuant: arXiv 2502.02617 (AISTATS 2026)
- QJL: arXiv 2406.03482
- Google Research Blog: TurboQuant: Redefining AI Efficiency
- Benchmarks — full quality/speed/retrieval data, community hardware results
- MLX Port — the experimental MLX Python port: results, quickstarts, MM-NIAH
- Changelog — v1 milestone history
- Quality Benchmarks — perplexity validation, bisection log
- Speed Investigation — Metal gotchas, fp16 WHT results, optimization history
- Speed Experiments — the full 739 → 2747 tok/s optimization journey
- Context Scaling Deep Dive — why turbo3 degraded at long context, how we fixed it
- Pre-Rotate-Queries Investigation — why graph-side WHT failed initially
- Quality + Speed Gate — pre-push script checking PPL AND context scaling ratio
Issues and PRs welcome. The main areas where help is needed:
- Upstream PR — prepare llama.cpp contribution (CONTRIBUTING.md requirements)
- CUDA kernel optimization — fused FA kernels, decode speed parity
- MLX memory recovery — implement FP16 KV drop + compressed-only attention for memory-constrained long context
- Quality metrics — multi-run statistics, additional task benchmarks (GSM8K, code gen, reasoning)
- Long context validation — 64K+ testing across architectures
If you find this work useful, you can support it via GitHub Sponsors or BTC:
BTC: bc1qsfaaf6mkz2yxx2vavg2n0zgsf3qj25uh94t83rwuq7de67dey05sc3tgjx
Commercial support: For inference optimization and KV cache tuning engagements, DM @no_stp_on_snek on X.
Apache License 2.0 — see LICENSE.
Copyright 2026 Tom Turney.
Based on Google Research's TurboQuant paper (arXiv 2504.19874).