Lists (2)
Sort Name ascending (A-Z)
Stars
DeepEP: an efficient expert-parallel communication library
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
FlashInfer: Kernel Library for LLM Serving
CUDA accelerated rasterization of gaussian splatting
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
GPU Accelerated t-SNE for CUDA with Python bindings
Causal depthwise conv1d in CUDA, with a PyTorch interface
Reference implementation of Megalodon 7B model
flash attention tutorial written in python, triton, cuda, cutlass
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Implementation of fused cosine similarity attention in the same style as Flash Attention
Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This repository contains the code for the experiments in the paper.
Lightweight Llama 3 8B Inference Engine in CUDA C