A causal, streaming linear attention mechanism that realizes higher‑order interactions via compact prefix statistics, with exact masked identities and associative scans enabling parallel training that matches recurrent computations.
Authors: Yifan Zhang, Zhen Qin, Quanquan Gu
[Webpage] [Huggingface]
The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any
@article{zhang2025higher,
title = {Higher-order Linear Attention},
author = {Zhang, Yifan and Qin, Zhen and Gu, Quanquan},
journal = {arXiv preprint arXiv:2510.27258},
year = {2025}
}