A simple implementation of PagedAttention purely written in CUDA and C++.
-
Updated
Aug 24, 2025 - Cuda
A simple implementation of PagedAttention purely written in CUDA and C++.
easy naive flash attention without optimization base on origin paper
Patch-Based Stochastic Attention (efficient attention mecanism)
Code for the paper "Cottention: Linear Transformers With Cosine Attention"
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
FlashInfer: Kernel Library for LLM Serving
Add a description, image, and links to the attention topic page so that developers can more easily learn about it.
To associate your repository with the attention topic, visit your repo's landing page and select "manage topics."