attention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

cuda triton attention vit quantization video-generation mlsys inference-acceleration efficient-attention llm llm-infra video-generate

Updated Nov 6, 2025
Cuda

thu-ml / SpargeAttn

Star

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

attention vit quantization video-generation mlsys inference-acceleration ai-infra vision-transformer sparse-attention llm sageattention

Updated Nov 10, 2025
Cuda

flashinfer-ai / flashinfer

Star

FlashInfer: Kernel Library for LLM Serving

gpu cuda jit pytorch nvidia moe attention llm-inference large-large-models distributed-inference

Updated Nov 10, 2025
Cuda

Improve this page

Add a description, image, and links to the attention topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the attention topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

attention

Here are 10 public repositories matching this topic...

i404788 / DPFP-pytorch

ncherel / psal

weiyu0824 / flash-attention-lite

gmongaras / Cottention_Transformer

XiaomingFun233 / flash_attn_cuda

xlite-dev / ffpa-attn

torotoki / simple-paged-attention

thu-ml / SageAttention

thu-ml / SpargeAttn

flashinfer-ai / flashinfer

Improve this page

Add this topic to your repo