Lists (1)
Sort Name ascending (A-Z)
Stars
9
stars
written in Cuda
Clear filter
A massively parallel, optimal functional runtime in Rust
Reference implementation of Megalodon 7B model
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
Implementation of fused cosine similarity attention in the same style as Flash Attention
Code for the paper "Cottention: Linear Transformers With Cosine Attention"