Highlights
- Pro
Stars
3
results
for source starred repositories
written in Cuda
Clear filter
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
FlashSparse significantly reduces the computation redundancy for unstructured sparsity (for SpMM and SDDMM) on Tensor Cores through a Swap-and-Transpose mapping strategy. FlashSparse is accepted by…
Fast and low-memory attention layer written in CUDA