We highly recommend using the spas_sage2_attn_meansim_topk_cuda and block_sparse_sage2_attn_cuda APIs. They are plug-and-play and customizable:
from spas_sage_attn import spas_sage2_attn_meansim_topk_cuda
attn_output = spas_sage2_attn_meansim_topk_cuda(q, k, v, topk=0.5, is_causal=False)You can adjust topk to balance between attention accuracy (higher topk is more accurate) and sparsity (lower topk is more sparse).
from spas_sage_attn import block_sparse_sage2_attn_cuda
attn_output = block_sparse_sage2_attn_cuda(q, k, v, mask_id=None):In this API, we support computing attention in any block sparse mask per attention head. Specifically, the per-head attention mask mask_id has shape (batch_size, num_heads, ⌈seq_len / 128⌉, ⌈seq_len // 64⌉) and consists of 0 and 1. Currently, the block size is 128×64.
The official implementation of SpargeAttn, a universal training-free sparse attention accelerating language, image, and video models.
- Please use the
spas_sage2_attn_meansim_topk_cudaandblock_sparse_sage2_attn_cudaAPIs. - [2025-07]: Release a Triton Kernel example.
- [2025-06]: SpargeAttn based on SageAttention2++ is released.
- [2025-05]: Add a very simple usage without tuning or calibration:
o = spas_sage2_attn_meansim_topk_cuda(q, k, v). - [2025-05]: 🎉SpargeAttn and SageAttention2 are accepted by ICML 2025!
- [2025-03] Support high acceleration on more GPUs, e.g., H100.
python>=3.9,torch>=2.3.0
CUDA:>=12.8for Blackwell,>=12.4for fp8 support on Ada,>=12.3for fp8 support on Hopper,>=12.0for Ampere
pip install ninja # for parallel compilation
python setup.py install # or pip install -e .-
spas_sage2_attn_meansim_topk_cuda: SpargeAttn based on SageAttention2 that we recommend using. -
spas_sage2_attn_meansim_cuda: SpargeAttn based on SageAttention2 that we do not recommend. -
spas_sage_attn_meansim_topk_cuda: SpargeAttn based on SageAttention that we recommend using. -
spas_sage_attn_meansim_cuda: SpargeAttn based on SageAttention that we do not recommend.
Just replace torch.nn.functional.scaled_dot_product_attention API using spas_sage2_attn_meansim_topk_cuda:
from spas_sage_attn import spas_sage2_attn_meansim_topk_cuda
- attn_output = torch.nn.functional.scaled_dot_product_attention(q, k, v, is_causal=False) # is_causal can be True
+ attn_output = spas_sage2_attn_meansim_topk_cuda(q, k, v, topk=0.5, is_causal=False) # is_causal can be Truefrom spas_sage_attn import spas_sage2_attn_meansim_topk_cuda
attn_output = spas_sage2_attn_meansim_topk_cuda(q, k, v, topk=0.5, is_causal=False)You can adjust topk to balance between attention accuracy (higher topk is more accurate) and sparsity (lower topk is more sparse).
from spas_sage_attn import block_sparse_sage2_attn_cuda
attn_output = block_sparse_sage2_attn_cuda(q, k, v, mask_id=None):In this API, we support computing attention for any block-sparse mask per attention head. Specifically, the per-head attention mask mask_id has shape (batch_size, num_heads, ⌈seq_len / 128⌉, ⌈seq_len // 64⌉) and consists of 0 and 1. Currently, the block size is 128×64.
@inproceedings{zhang2025spargeattn,
title={Spargeattn: Accurate sparse attention accelerating any model inference},
author={Zhang, Jintao and Xiang, Chendong and Huang, Haofeng and Wei, Jia and Xi, Haocheng and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Machine Learning (ICML)},
year={2025}
}
@inproceedings{zhang2025sageattention,
title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration},
author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Learning Representations (ICLR)},
year={2025}
}
@inproceedings{zhang2024sageattention2,
title={Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization},
author={Zhang, Jintao and Huang, Haofeng and Zhang, Pengle and Wei, Jia and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Machine Learning (ICML)},
year={2025}
}