Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
-
Updated
Jun 11, 2025 - C++
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Add a description, image, and links to the mha topic page so that developers can more easily learn about it.
To associate your repository with the mha topic, visit your repo's landing page and select "manage topics."