Stars
A kernel library written in tilelang
A CUDA kernel optimization toolkit for validation, benchmarking, Nsight Compute profiling, bottleneck analysis, and iterative tuning. It helps improve custom GPU operators with reproducible workflo…
Accelerating MoE with IO and Tile-aware Optimizations
Persistent file-based planning for AI coding agents and long-running agentic tasks. Crash-proof markdown plans that survive context loss and /clear, plus a deterministic completion gate and multi-a…
An agentic skills framework & software development methodology that works.
AI agents running research on single-GPU nanochat training automatically
高性能短序列稀疏Mask Attention CUDA算子,针对<1K序列+75%稀疏度优化
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
A Next-Generation Training Engine Built for Ultra-Large MoE Models
KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators
DefTruth / CUDA-Learn-Notes
Forked from xlite-dev/LeetCUDA📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Triton implementation of FlashAttention2 that adds Custom Masks.
[DEPRECATED] Moved to ROCm/rocm-libraries repo. NOTE: develop branch is maintained as a read-only mirror
Efficient Triton Kernels for LLM Training
Collection of kernels written in Triton language
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP: an efficient expert-parallel communication library
Repository hosting code for "Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations" (https://arxiv.org/abs/2402.17152).
The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.