Near-Linear Attention in Theory and in Practice
Tobias Schröder & Lester Mackey
WildCat (Weighted Iterative Low-rank Decomposition for Coreset ATtention) is a drop-in replacement for scaled dot-product attention that faithfully approximates exact attention in near-linear time. The core of WildCat is compress_kv, an efficient algorithm for compressing the key-value sequence into a small weighted coreset. WildCat can be used to either accelerate non-causal attention at inference time or to compress a pre-computed KV cache to near-constant size.
WildCat was tested with Python 3.12. Install directly from GitHub:
pip install git+https://github.com/microsoft/wildcat.gitOr clone and install locally:
git clone https://github.com/microsoft/wildcat.git
pip install -e wildcatThe torch and numpy dependencies will be installed automatically.
The WildCat module can be used as drop-in replacement for standard attention implementations at inference time. Causal masking is not yet supported.
import torch
from wildcat import WildCat
# Initialise module
attn = WildCat(
r=128, # coreset size (number of KV pairs to keep)
num_bins=1, # compression is distributed across bins
subsample_ratio=0.25, # fallback compression ratio when r is not set
precompile = True # use pre-compilation on GPU for additional performance gains
)
B, H, N, D = 2, 8, 4096, 64
queries = torch.randn(B, H, N, D)
keys = torch.randn(B, H, N, D)
values = torch.randn(B, H, N, D)
# Drop-in replacement for standard attention
output = attn(queries, keys, values) # shape: (B, H, N, D)The compress_kv function can also be used standalone to compress a KV cache:
from wildcat.compress_kv import compress_kv
cmpd_keys, cmpd_values, weights = compress_kv(keys, values, r=128)We tested WildCat on image generation, image classification, and KV cache compression for long context language understanding tasks. To replicate an experiment, navigate to the corresponding examples subfolder and follow the setup instructions.
- BigGAN Image Generation: examples/biggan
- T2T-ViT ImageNet Classification: examples/t2t
- KV Cache Compression for long-context language understanding: examples/kvcache
The goal of WildCat is the approximation of the softmax (or scaled dot-product) attention mechanism
for
The weights
The parameter
The coreset indices rp_nystrom. As a result, the compression is fast and numerically stable, requiring only
WildCat: Near-Linear Attention in Theory and Practice
@inproceedings{
schroder2026wildcat,
title={WildCat: Near-Linear Attention in Theory and Practice},
author={Schr{\"o}der, Tobias and Mackey, Lester},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=lfqyLp4hZm}
}This project is licensed under the MIT License.