attention
Here are 10 public repositories matching this topic...
easy naive flash attention without optimization base on origin paper
-
Updated
Jun 25, 2025 - Cuda
Code for the paper "Cottention: Linear Transformers With Cosine Attention"
-
Updated
Oct 19, 2024 - Cuda
Patch-Based Stochastic Attention (efficient attention mecanism)
-
Updated
Jan 16, 2023 - Cuda
A simple implementation of PagedAttention purely written in CUDA and C++.
-
Updated
Aug 24, 2025 - Cuda
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
-
Updated
Aug 8, 2025 - Cuda
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
-
Updated
Nov 11, 2025 - Cuda
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
-
Updated
Nov 6, 2025 - Cuda
FlashInfer: Kernel Library for LLM Serving
-
Updated
Nov 12, 2025 - Cuda
Improve this page
Add a description, image, and links to the attention topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the attention topic, visit your repo's landing page and select "manage topics."