Lists (1)
Sort Name ascending (A-Z)
Stars
11
stars
written in Cuda
Clear filter
A massively parallel, optimal functional runtime in Rust
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
Reference implementation of Megalodon 7B model
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
Implementation of fused cosine similarity attention in the same style as Flash Attention
Code for the paper "Cottention: Linear Transformers With Cosine Attention"