Stars
[CVPR 2026 Highlight] Official implementation of Log-linear Sparse Attention (LLSA).
mKernel: fast multi-node, multi-GPU fused kernels
Official native C++ client SDK for LiveKit: build realtime audio, video, and data applications using the LiveKit protocol.
A simple, performant and scalable Jax LLM!
Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels.
AI agents running research on single-GPU nanochat training automatically
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention
A place to store reusable transformer components of my own creation or found on the interwebs
RL agent fusing real-time Binance futures data into Polymarket prediction markets. On-device training with MLX on Apple Silicon.
Triton-based Symmetric Memory operators and examples
Tensors and Dynamic neural networks in Python with strong GPU acceleration
NSA Triton Kernels written with GPT5 and Opus 4.1
Cog wrapper of Black Forest Lab's / FLUX.1 Kontext [dev]
Interactive visualizations of the geometric intuition behind diffusion models.
A PyTorch native platform for training generative AI models
depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.
FlashMLA: Efficient Multi-head Latent Attention Kernels
https://wavespeed.ai/ Context parallel attention that accelerates DiT model inference with dynamic caching
A bunch of kernels that might make stuff slower 😉
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)