Skip to content
View awgu's full-sized avatar
😴
😴

Block or report awgu

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

Python 5,432 486 Updated Mar 27, 2026

Supporting code for the blog post on modular manifolds.

Python 121 13 Updated Sep 26, 2025

A Quirky Assortment of CuTe Kernels

Python 866 100 Updated Mar 27, 2026

🚀 Efficient implementations of state-of-the-art linear attention models

Python 4,713 461 Updated Mar 27, 2026

FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/

C++ 1,548 730 Updated Mar 27, 2026

Distributed Compiler based on Triton for Parallel Systems

Python 1,396 134 Updated Mar 11, 2026

Fast and memory-efficient exact attention

Python 23,002 2,559 Updated Mar 26, 2026

Ongoing research training transformer models at scale

Python 15,822 3,762 Updated Mar 26, 2026

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance…

Python 3,246 678 Updated Mar 25, 2026

SGLang is a high-performance serving framework for large language models and multimodal models.

Python 25,086 5,027 Updated Mar 27, 2026

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

7,972 288 Updated May 15, 2025

Analyze computation-communication overlap in V3/R1.

1,150 145 Updated Mar 21, 2025

A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training.

Python 2,936 318 Updated Jan 14, 2026

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 6,289 842 Updated Mar 22, 2026

DeepEP: an efficient expert-parallel communication library

Cuda 9,075 1,130 Updated Feb 9, 2026
Python 164 16 Updated Dec 27, 2024

Tile primitives for speedy kernels

Cuda 3,276 267 Updated Mar 25, 2026

Important concepts in numerical linear algebra and related areas

812 68 Updated Jan 13, 2024

CUDA Templates and Python DSLs for High-Performance Linear Algebra

C++ 9,496 1,751 Updated Mar 24, 2026

A PyTorch native platform for training generative AI models

Python 5,191 760 Updated Mar 27, 2026

The Python programming language

Python 72,117 34,315 Updated Mar 27, 2026

Development repository for the Triton language and compiler

MLIR 18,777 2,708 Updated Mar 27, 2026

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Python 98,609 27,317 Updated Mar 27, 2026

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more

Python 35,235 3,489 Updated Mar 27, 2026