[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 2,861 286 Updated Dec 11, 2025

yassa9 / qwen600

Static suckless single batch CUDA-only qwen3-0.6B mini inference engine

Cuda 534 44 Updated Sep 8, 2025

XuezheMax / megalodon

Reference implementation of Megalodon 7B model

Cuda 527 54 Updated May 17, 2025

66RING / tiny-flash-attention

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 459 50 Updated May 14, 2025

likejazz / llama3.cuda

llama3.cuda is a pure C/CUDA implementation for Llama 3 model.

Cuda 349 26 Updated Apr 27, 2025

ByteDance-Seed / decoupleQ

A quantization algorithm for LLM

Cuda 146 8 Updated Jun 21, 2024

casper-hansen / AutoAWQ_kernels

Cuda 78 23 Updated Nov 26, 2024

l1351868270 / ld_qwen2.c

Cuda 5 Updated Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

leixy76

Block or report leixy76

Lists (9)

agent

BI

fine-tuning

function

Inference

llama

orchestration

prompt

rag

Starred repositories

karpathy / llm.c

deepseek-ai / DeepEP

deepseek-ai / DeepGEMM

flashinfer-ai / flashinfer

thu-ml / SageAttention