Skip to content
View ZanePoe's full-sized avatar

Block or report ZanePoe

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

8 stars written in Cuda
Clear filter

FlashInfer: Kernel Library for LLM Serving

Cuda 4,019 558 Updated Nov 6, 2025

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 2,622 258 Updated Oct 28, 2025

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 961 97 Updated Dec 30, 2024

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

Cuda 758 65 Updated Oct 31, 2025

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

Cuda 270 22 Updated Jul 16, 2025

CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and qua…

Cuda 201 20 Updated Oct 10, 2025

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

Cuda 75 4 Updated Oct 16, 2025

Sage attention for turning.

Cuda 25 3 Updated Sep 5, 2025