#
flash-attention
Here are 5 public repositories matching this topic...
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
gpu cuda inference nvidia mha mla multi-head-attention gqa mqa llm large-language-model flash-attention cuda-core decoding-attention flashinfer flashmla
-
Updated
Jun 11, 2025 - C++
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
gpu cuda inference nvidia cutlass mha multi-head-attention llm tensor-core large-language-model flash-attention flash-attention-2
-
Updated
Feb 27, 2025 - C++
Vulkan & GLSL implementation of FlashAttention-2
vulkan glsl artificial-intelligence gpu-acceleration attention gpu-computing deel-learning tensor-cores large-language-models llm flash-attention flash-attention-2
-
Updated
Jan 19, 2025 - C++
Poplar implementation of FlashAttention for IPU
-
Updated
Mar 12, 2024 - C++
Improve this page
Add a description, image, and links to the flash-attention topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the flash-attention topic, visit your repo's landing page and select "manage topics."