Stars
A dynamic binary instrumentation tool for tracing and analyzing CUDA kernel instructions.
🎬 3.7× faster video generation E2E 🖼️ 1.6× faster image generation E2E ⚡ ColumnSparseAttn 9.3× vs FlashAttn‑3 💨 ColumnSparseGEMM 2.5× vs cuBLAS
Causal depthwise conv1d in CUDA, with a PyTorch interface
Ongoing research training transformer language models at scale, including: BERT & GPT-2
[AAAI 2026] Official implementation of "FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models". If you find this repository helpful, please consider starring 🌟 it to support the p…
Zonos2 is a leading open-weight text-to-speech MoE.
Implementation of 2-simplicial attention proposed by Clift et al. (2019) and the recent attempt to make practical in Fast and Simplex, Roy et al. (2025)
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
AdaSplash: Adaptive Sparse Flash Attention (aka Flash Entmax Attention)
Triton kernels for dynamic causal short convolutions.
LM engine is a library for pretraining/finetuning LLMs
Official PyTorch Implementation of Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
[ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule
CUDA kernels for linear attention variants, written in CuTe DSL and CUTLASS C++.
Official repository for Parallax (Parameterized Local Linear Attention)
The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.
Delta Attention Residuals - supplementary code and pretrained models
SpectralQuant: Calibrated Eigenbasis Rotation and Water-Filled Bit Allocation for KV-Cache Compression
A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
A kernel library written in tilelang
Experimental GPU language with meta-programming
Sequential Monte Carlo Speculative Decoding