Stars
Official inference repo for FLUX.2 models
Super basic implementation (gist-like) of RLMs with REPL environments.
Official PyTorch Implementation of "Diffusion Transformers with Representation Autoencoders"
A framework for few-shot evaluation of language models.
AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming
Ship correct and fast LLM kernels to PyTorch
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
Qwen-Image is a powerful image generation foundation model capable of complex text rendering and precise image editing.
Python tool for converting files and office documents to Markdown.
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
Super fast FP32 matrix multiplication on RDNA3
Simple high-throughput inference library
ROCm / triton
Forked from triton-lang/tritonDevelopment repository for the Triton language and compiler
Official implementation of Half-Quadratic Quantization (HQQ)
A TTS model capable of generating ultra-realistic dialogue in one pass.
Framework to reduce autotune overhead to zero for well known deployments.
Distributed Compiler based on Triton for Parallel Systems
Gzip Decompression and Random Access for Modern Multi-Core Machines
mingfeima / sglang
Forked from sgl-project/sglangSGLang is a fast serving framework for large language models and vision language models.
FlashMLA: Efficient Multi-head Latent Attention Kernels
Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!
Cost-efficient and pluggable Infrastructure components for GenAI inference