Efficient Triton Kernels for LLM Training
-
Updated
Apr 1, 2026 - Python
Efficient Triton Kernels for LLM Training
FlagGems is an operator library for large language models implemented in the Triton Language.
Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels.
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
Trainable fast and memory-efficient sparse attention
Automatic ROPChain Generation
Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!
SymGDB - symbolic execution plugin for gdb
TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels
FlashSinkhorn: IO-Aware Entropic Optimal Transport in PyTorch + Triton. Streaming Sinkhorn with O(nd) memory.
AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming
A performance library for machine learning applications.
nanoRLHF: from-scratch journey into how LLMs and RLHF really work.
Triton implementation of FlashAttention2 that adds Custom Masks.
ClearML - Model-Serving Orchestration and Repository Solution
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
LLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, MoE expert parallelism, OpenAI-compatible serving
(WIP)The deployment framework aims to provide a simple, lightweight, fast integrated, pipelined deployment framework for algorithm service that ensures reliability, high concurrency and scalability of services.
Add a description, image, and links to the triton topic page so that developers can more easily learn about it.
To associate your repository with the triton topic, visit your repo's landing page and select "manage topics."